I would like to get a HTML string without certain elements. However, upfront I just know which elements to keep but don't know which ones to drop.
Let's say I just want to keep all p
and a
tags inside the div
with class="A"
.
Input:
<div class="A">
<p>Text1</p>
<img src="A.jpg">
<div class="sub1">
<p>Subtext1</p>
</div>
<p>Text2</p>
<a href="url">link text</a>
</div>
<div class="B">
ContentDiv2
</div>
Expected output:
<div class="A">
<p>Text1</p>
<p>Text2</p>
<a href="url">link text</a>
</div>
If I'd know all the selectors of all other elements I could just use lxml
's drop_tree()
. But the problem is that I don't know ['img', 'div.sub1', 'div.B']
upfront.
Example with drop_tree()
:
import lxml.cssselect
import lxml.html
tree = lxml.html.fromstring(html_str)
elements_drop = ['img', 'div.sub1', 'div.B']
for j in elements_drop:
selector = lxml.cssselect.CSSSelector(j)
for e in selector(tree):
e.drop_tree()
output = lxml.html.tostring(tree)
I'm still not entirely sure I understand correctly, but it seems like you may be looking for something resembling this:
target = tree.xpath('//div[@class="A"]')[0]
to_keep = target.xpath('//p | //a')
for t in target.xpath('.//*'):
if t not in to_keep:
target.remove(t) #I believe this method is better here than drop_tree()
print(lxml.html.tostring(target).decode())
The output I get is your expected output.