How can I get the source index of a text node in an HTML string?
Tags have sourceline
and sourcepos
which is useful for this, but NavigableString
does not have any directly-helpful properties like that (as far as I can find)
I've thought about using
def get_index(text_node: NavigableString) -> int:
return text_node.next_element.sourcepos - len(text_node)
But this will not work perfectly because the length of the closing tag is unpredictable, e.g.
>>> get_index(BeautifulSoup('<p>hello</p><br>', 'html.parser').find(text=True))
7
Is incorrect, and '<p>hello</p >'
is also valid HTML and will produce an even more incorrect result, and I'm not sure how to solve this kind of case using the tools I've found so far in BeautifulSoup.
I would also be interested in an lxml or Python html module answer if they have simple solutions.
Desired results:
>>> get_index(BeautifulSoup('hello', 'html.parser').find(text=True))
0
>>> get_index(BeautifulSoup('<p>hello</p><br>', 'html.parser').find(text=True))
3
>>> get_index(BeautifulSoup('<!-- hi -->hello', 'html.parser').find(text=True))
11
>>> get_index(BeautifulSoup('<p></p ><p >hello<br>there</p>', 'html.parser').find(text=True))
12
>>> get_index(BeautifulSoup('<p></p ><p >hello<br>there</p>', 'html.parser').find_all(string=True)[1])
21
Using html.parser
:
class MyHTMLParser(HTMLParser):
def handle_data(self, data: str):
line, col = self.getpos()
previous_lines = ''.join(html_string.splitlines(True)[:line - 1])
index = len(previous_lines) + col
print(data, 'at', index)
parser = MyHTMLParser()
parser.feed(html_string)