I am using BeautifulSoup to extract information from HTML files. I would like to be able to capture the location of the information, that is the offset within the HTML file of the tag that of a BS tag object.
Is there a way to do this?
I am currently using the lxml parser as it is the default.
If I'm reading your question correctly, you are parsing some html with BeautifulSoup and then using the soup to identify a tag. Once you have the tag, you are trying to find the index position of the tag within the original html string.
The problem with capturing the index position of a tag using BeautifulSoup is that the soup will alter the structure of the html based on the given parser. The lxml parsing might not provide a character for character representation, especially after finding a tag within the soup.
It's iffy if this will consistently work, but you might try using a string's find method to find the position of your tag's text contents, which should remain largely unchanged.
#!python
# html is a string containing your html document
soup = BeautifulSoup(html,'lxml')
# target is the tag you want to find
target = soup.find('p')
# now we locate the text of the target inside of the html document
html.find((target.text))
This method will not start at the beginning of the tag, but should be able to locate the tag's contents within the html.
If you wanted to know the index of a tag in the body of your soup, that would be much more feasible.