pythonweb-scrapingbeautifulsoup

Using BeautifulSoup, how to select a tag without its children?


The html is as follows:

<body>
    <div name='tag-i-want'>
        <span>I don't want this</span>
    </div>
</body>

I'm trying to get all the divs and cast them into strings:

divs = [str(i) for i in soup.find_all('div')]

However, they'll have their children too:

>>> ["<div name='tag-i-want'><span>I don't want this</span></div>"]

What I'd like it to be is:

>>> ["<div name='tag-i-want'></div>"]

I figured there is unwrap() which would return this, but it modifies the soup as well; I'd like the soup to remain untouched.


Solution

  • With clear you remove the tag's content. Without altering the soup you can either do an hardcopy with copy or use a DIY approach. Here an example with the copy

    from bs4 import BeautifulSoup
    import copy
    
    html = """<body>
        <div name='tag-i-want'>
            <span>I don't want this</span>
        </div>
    </body>"""
    
    soup = BeautifulSoup(html, 'lxml')
    div = soup.find('div')
    
    div_only = copy.copy(div)
    div_only.clear()
    
    
    print(div_only)
    print(soup.find_all('span') != [])
    

    Output

    <div name="tag-i-want"></div>
    True
    

    Remark: the DIY approach: without copy

    from bs4 import BeautifulSoup, Tag
    html = """<body>
        <div name='tag-i-want'>
            <span>I don't want this</span>
        </div>
    </body>"""
    soup = BeautifulSoup(html, 'html.parser')
    
    # print the first div without it's children:
    print(Tag(name=soup.div.name, attrs=soup.div.attrs))
    
    # print all divs without it's children: 
    for i in soup.find_all("div"):
        print(Tag(name=i.name, attrs=i.attrs))
    
    div_only = '<div {}></div>'.format(' '.join(map(lambda p: f'{p[0]}="{p[1]}"', div.attrs.items())))