haskellhxt

Extracting Values from a Subtree


I am parsing an XML file with HXT and I am trying to break up some of the node extraction into modular pieces (I have been using this as my guide). Unfortunately, I cannot figure out how to apply some of the selectors once I do the first level parsing.

 import Text.XML.HXT.Core

 let node tag = multi (hasName tag)
 xml <- readFile "test.xml"
 let doc = readString [withValidate yes, withParseHTML no, withWarnings no] xml
 books <- runX $ doc >>> node "book"

I see that books has a type [XmlTree]

 :t books
 books :: [XmlTree]

Now I would like to get the first element of books and then extract some values inside the sub-tree.

 let b = head(books)
 runX $ b >>> node "cost"

Couldn't match type ‘Data.Tree.NTree.TypeDefs.NTree’
               with ‘IOSLA (XIOState ()) XmlTree’
Expected type: IOSLA (XIOState ()) XmlTree XNode
  Actual type: XmlTree
In the first argument of ‘(>>>)’, namely ‘b’
In the second argument of ‘($)’, namely ‘b >>> node "cost"’

I cannot find selectors once I have an XmlTree and I am showing the above incorrect usage to illustrate what I would like to. I know I can do this:

 runX $ doc >>> node "book" >>> node "cost" /> getText
 ["55.9","95.0"]

But I am not only interested in cost but also many more elements inside book. The XML file is pretty deep so I don't want to nest everything with <+> and much rater prefer extract the chunk I want and then extract the sub-elements in a separate function.

Example (made-up) XML File:

 <?xml version="1.0" encoding="UTF-8"?><start xmlns="http://www.example.com/namespace" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
     <books> 
         <book>
             <author>
                 <name>
                     <first>Joe</first>
                     <last>Smith</last>
                 </name>
                 <city>New York City</city>
             </author>
             <released>1990-11-15</released>
             <isbn>1234567890</isbn>
             <publisher>X Publisher</publisher>
             <cost>55.9</cost>
         </book>
         <book>
             <author>
                 <name>
                     <first>Jane</first>
                     <last>Jones</last>
                 </name>
                 <city>San Francisco</city>
             </author>
             <released>1999-01-19</released>
             <isbn>0987654321</isbn>
             <publisher>Y Publisher</publisher>
             <cost>95.0</cost>
         </book>
     </books>
  </start> 

Can someone help me understand, how to extract the sub-elements of book? Ideally with something as nice as >>> and node so I can define my own functions such as getCost, getName, etc. that each will roughly have the signature XmlTree -> [String]


Solution

  • doc is not what you thought it is. It has type IOStateArrow s b XmlTree. You really should read your guide again, all you want to know was concluded under the title "Avoiding IO".

    Arrows are basically functions. SomeArrow a b can be considered as a generalized/specialized function of type a -> b. >>> and other operators in the scope are for arrow composition, similar to function composition. Your books has type [XmlTree] so it's not an arrow and cannot be composed with arrows. What fulfills your needs is runLA, it transforms an arrow like node "tag" to a normal function:

    module Main where
    
    import           Text.XML.HXT.Core
    
    main = do
      html <- readFile "test.xml"
      let doc = readString [withValidate yes, withParseHTML no, withWarnings no] html
      books <- runX $ doc >>> node "book"
      -- runLA (node "cost" /> getText) :: XmlTree -> [String]
      let costs = books >>= runLA (node "cost" /> getText)
      print costs
    
    node tag = multi (hasName tag)