I have a html page, with one structure that I want to turn into Clojure data structure. I’m hitting a mental block on how to approach this in an idiomatic way
This is the structure I have:
<div class=“group”>
<h2>title1<h2>
<div class=“subgroup”>
<p>unused</p>
<h3>subheading1</h3>
<a href=“path1” />
</div>
<div class=“subgroup”>
<p>unused</p>
<h3>subheading2</h3>
<a href=“path2” />
</div>
</div>
<div class=“group”>
<h2>title2<h2>
<div class=“subgroup”>
<p>unused</p>
<h3>subheading3</h3>
<a href=“path3” />
</div>
</div>
Structure I want:
'(
[“Title1” “subhead1” “path1”]
[“Title1” “subhead2” “path2”]
[“Title2” “subhead3” “path3”]
[“Title3” “subhead4” “path4”]
[“Title3” “subhead5” “path5”]
[“Title3” “subhead6” “path6”]
)
The repetition of titles is intentional.
I’ve read David Nolan’s enlive tutorial. That offers a good solution if there was a parity between group and subgroup, but in this case it can be random.
Thanks for any advice.
You can use Hickory for parsing, and then Clojure has some very nice tools for transforming the parsed HTML to the form you want:
(require '[hickory.core :as html])
(defn classifier [tag klass]
(comp #{[:element tag klass]} (juxt :type :tag (comp :class :attrs))))
(def group? (classifier :div "“group”"))
(def subgroup? (classifier :div "“subgroup”"))
(def path? (classifier :a nil))
(defn identifier? [tag] (classifier tag nil))
(defn only [x]
;; https://stackoverflow.com/a/14792289/5044950
{:pre [(seq x)
(nil? (next x))]}
(first x))
(defn identifier [tag element]
(->> element :content (filter (identifier? tag)) only :content only))
(defn process [data]
(for [group (filter group? (map html/as-hickory (html/parse-fragment data)))
:let [title (identifier :h2 group)]
subgroup (filter subgroup? (:content group))
:let [subheading (identifier :h3 subgroup)]
path (filter path? (:content subgroup))]
[title subheading (:href (:attrs path))]))
Example:
(require '[clojure.pprint :as pprint])
(def data
"<div class=“group”>
<h2>title1</h2>
<div class=“subgroup”>
<p>unused</p>
<h3>subheading1</h3>
<a href=“path1” />
</div>
<div class=“subgroup”>
<p>unused</p>
<h3>subheading2</h3>
<a href=“path2” />
</div>
</div>
<div class=“group”>
<h2>title2</h2>
<div class=“subgroup”>
<p>unused</p>
<h3>subheading3</h3>
<a href=“path3” />
</div>
</div>")
(pprint/pprint (process data))
;; (["title1" "subheading1" "“path1”"]
;; ["title1" "subheading2" "“path2”"]
;; ["title2" "subheading3" "“path3”"])