htmlclojureenlive

turning a html structure into a Clojure Structure


I have a html page, with one structure that I want to turn into Clojure data structure. I’m hitting a mental block on how to approach this in an idiomatic way

This is the structure I have:

<div class=“group”>
  <h2>title1<h2>
  <div class=“subgroup”>
    <p>unused</p>
    <h3>subheading1</h3>
    <a href=“path1” />
  </div>
  <div class=“subgroup”>
    <p>unused</p>
    <h3>subheading2</h3>
    <a href=“path2” />
  </div>
</div>
<div class=“group”>
  <h2>title2<h2>
  <div class=“subgroup”>
    <p>unused</p>
    <h3>subheading3</h3>
    <a href=“path3” />
  </div>
</div>

Structure I want:

'(
[“Title1” “subhead1” “path1”]
[“Title1” “subhead2” “path2”]
[“Title2” “subhead3” “path3”]
[“Title3” “subhead4” “path4”]
[“Title3” “subhead5” “path5”]
[“Title3” “subhead6” “path6”]
)

The repetition of titles is intentional.

I’ve read David Nolan’s enlive tutorial. That offers a good solution if there was a parity between group and subgroup, but in this case it can be random.

Thanks for any advice.


Solution

  • You can use Hickory for parsing, and then Clojure has some very nice tools for transforming the parsed HTML to the form you want:

    (require '[hickory.core :as html])
    
    (defn classifier [tag klass]
      (comp #{[:element tag klass]} (juxt :type :tag (comp :class :attrs))))
    
    (def group? (classifier :div "“group”"))
    (def subgroup? (classifier :div "“subgroup”"))
    (def path? (classifier :a nil))
    (defn identifier? [tag] (classifier tag nil))
    
    (defn only [x]
      ;; https://stackoverflow.com/a/14792289/5044950
      {:pre [(seq x)
             (nil? (next x))]}
      (first x))
    
    (defn identifier [tag element]
      (->> element :content (filter (identifier? tag)) only :content only))
    
    (defn process [data]
      (for [group (filter group? (map html/as-hickory (html/parse-fragment data)))
            :let [title (identifier :h2 group)]
            subgroup (filter subgroup? (:content group))
            :let [subheading (identifier :h3 subgroup)]
            path (filter path? (:content subgroup))]
        [title subheading (:href (:attrs path))]))
    

    Example:

    (require '[clojure.pprint :as pprint])
    
    (def data
    "<div class=“group”>
      <h2>title1</h2>
      <div class=“subgroup”>
        <p>unused</p>
        <h3>subheading1</h3>
        <a href=“path1” />
      </div>
      <div class=“subgroup”>
        <p>unused</p>
        <h3>subheading2</h3>
        <a href=“path2” />
      </div>
    </div>
    <div class=“group”>
      <h2>title2</h2>
      <div class=“subgroup”>
        <p>unused</p>
        <h3>subheading3</h3>
        <a href=“path3” />
      </div>
    </div>")
    
    (pprint/pprint (process data))
    ;; (["title1" "subheading1" "“path1”"]
    ;;  ["title1" "subheading2" "“path2”"]
    ;;  ["title2" "subheading3" "“path3”"])