pythonbeautifulsoup

How to have BeautifulSoup use my custom class instead of the bs4.element.Tag class that it uses to create the tree?


I am using BeautifulSoup4 to parse HTML string into a structured object, but for each HTML element (e.g. soup.body.title) I wanted there to be a attribute called embed (e.g. soup.body.title.embed).

So I created child classes for Tag and BeautifulSoup with the embed attribute added, and while the type of the root node object is EmbedSoup, which is as I intended, the type of soup.body is bs4.element.Tag instead of EmbedTag.

How do I make sure that all elements of the BeautifulSoup Tree are of type EmbedTag and not bs4.element.Tag? Is there another solution to this problem that I am having?

class EmbedTag(Tag):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.embed = None  # Initialize the embed attribute to None

class EmbedSoup(BeautifulSoup):
    def __init__(self, *args, **kwargs):
        kwargs['element_classes'] = {'tag': EmbedTag}
        super().__init__(*args, **kwargs)
# Parse the HTML with the custom BeautifulSoup class
soup = EmbedSoup(html_content, 'html.parser')
type(soup) --> EmbedSoup
type(soup.body) --> bs4.element.Tag 

Solution

  • According to the Beautiful Soup documentation, it looks like you have to supply a dictionary that is a mapping from type to type rather than from str to type when you want to use custom sub-classes of Beautiful Soup classes:

    from bs4 import Tag
    
    class EmbedSoup(BeautifulSoup):
        def __init__(self, *args, **kwargs):
            kwargs['element_classes'] = {Tag: EmbedTag}
            super().__init__(*args, **kwargs)