pythonbeautifulsouplxml

How to get the index of a text node in BeautifulSoup?


How can I get the source index of a text node in an HTML string?

Tags have sourceline and sourcepos which is useful for this, but NavigableString does not have any directly-helpful properties like that (as far as I can find)

I've thought about using

def get_index(text_node: NavigableString) -> int:
    return text_node.next_element.sourcepos - len(text_node)

But this will not work perfectly because the length of the closing tag is unpredictable, e.g.

>>> get_index(BeautifulSoup('<p>hello</p><br>', 'html.parser').find(text=True))
7

Is incorrect, and '<p>hello</p >' is also valid HTML and will produce an even more incorrect result, and I'm not sure how to solve this kind of case using the tools I've found so far in BeautifulSoup.

I would also be interested in an lxml or Python html module answer if they have simple solutions.

Desired results:

>>> get_index(BeautifulSoup('hello', 'html.parser').find(text=True))
0
>>> get_index(BeautifulSoup('<p>hello</p><br>', 'html.parser').find(text=True))
3
>>> get_index(BeautifulSoup('<!-- hi -->hello', 'html.parser').find(text=True))
11
>>> get_index(BeautifulSoup('<p></p ><p >hello<br>there</p>', 'html.parser').find(text=True))
12
>>> get_index(BeautifulSoup('<p></p ><p >hello<br>there</p>', 'html.parser').find_all(string=True)[1])
21

Solution

  • Using html.parser:

    class MyHTMLParser(HTMLParser):
        def handle_data(self, data: str):
            line, col = self.getpos()
            previous_lines = ''.join(html_string.splitlines(True)[:line - 1])
            index = len(previous_lines) + col
            print(data, 'at', index)
    parser = MyHTMLParser()
    parser.feed(html_string)