pythonhtmllxmllxml.html

How to get text from HTML element by using lxml.html


I've been trying to get a full text hosted inside a <div> element from the web page https://www.list-org.com/company/11665809.
The element should contain a sub-string "Арбитраж".
And it does, because my code

for div in tree.xpath('.//div[contains(text(), "Арбитраж")]'):
    print(div)

returns response

Element div at 0x15480d93ac8

enter image description here

But when I'm trying to get the full text itself by using method div.text, it returns None
Which is a strange result, I think. What should I do?
Any help would be greatly appreciated. As well as an advice about source to learn basics of HTML (not a savvy programmer guy) to avoid such an easy question in the future.


Solution

  • This is one of these strange things that happens when xpath is handled by a host language and library. When you use the xpath expression

     .//div[contains(text(), "Арбитраж")] 
    

    the search is performed according to xpath rules, which considers the target text as contained within the target div. When you go on to the next line:

    print(div.text)
    

    you are using lxml.html, which apparently doesn't regard the target text as part of the div text, because it's preceded by the <i> tag. To get to it, with lxml.html, you have to use:

    print(div.text_content())
    

    or with xpath only:

    print(tree.xpath('.//div[contains(text(), "Арбитраж")]/text()')[0])
    

    It seems lxml.etree and beautifulsoup use different approaches. See this interesting discussion here.