The html is as follows:
<body>
<div name='tag-i-want'>
<span>I don't want this</span>
</div>
</body>
I'm trying to get all the divs and cast them into strings:
divs = [str(i) for i in soup.find_all('div')]
However, they'll have their children too:
>>> ["<div name='tag-i-want'><span>I don't want this</span></div>"]
What I'd like it to be is:
>>> ["<div name='tag-i-want'></div>"]
I figured there is unwrap()
which would return this, but it modifies the soup as well; I'd like the soup to remain untouched.
With clear
you remove the tag's content.
Without altering the soup you can either do an hardcopy with copy
or use a DIY approach. Here an example with the copy
from bs4 import BeautifulSoup
import copy
html = """<body>
<div name='tag-i-want'>
<span>I don't want this</span>
</div>
</body>"""
soup = BeautifulSoup(html, 'lxml')
div = soup.find('div')
div_only = copy.copy(div)
div_only.clear()
print(div_only)
print(soup.find_all('span') != [])
Output
<div name="tag-i-want"></div>
True
Remark: the DIY approach: without copy
Tag
classfrom bs4 import BeautifulSoup, Tag
html = """<body>
<div name='tag-i-want'>
<span>I don't want this</span>
</div>
</body>"""
soup = BeautifulSoup(html, 'html.parser')
# print the first div without it's children:
print(Tag(name=soup.div.name, attrs=soup.div.attrs))
# print all divs without it's children:
for i in soup.find_all("div"):
print(Tag(name=i.name, attrs=i.attrs))
div_only = '<div {}></div>'.format(' '.join(map(lambda p: f'{p[0]}="{p[1]}"', div.attrs.items())))