pythonhtmllxmllxml.html

Attempting to get the text from a certain part of a website using lxml.html


I have some current Python code that is supposed to get the HTML from a certain part of a website, using the xpath of where the HTML tag is located.

def wordorigins(word):
    pageopen = lxml.html.fromstring("http://www.merriam-webster.com/dictionary/" + str(word))
    pbody = pageopen.xpath("/html/body/div[1]/div/div[4]/div/div[1]/main/article/div[5]/div[3]/div[1]/div/p[1]")
    etybody = lxml.html.fromstring(pbody)
    etytxt = etybody.xpath('text()')
    etytxt = etytxt.replace("<em>", "")
    etytxt = etytxt.replace("</em>", "")
    return etytxt

This code returns this error about expecting a string or a buffer:

Traceback (most recent call last):
  File "mott.py", line 47, in <module>
    print wordorigins(x)
  File "mott.py", line 30, in wordorigins
    etybody = lxml.html.fromstring(pbody)
  File "/usr/lib/python2.7/site-packages/lxml/html/__init__.py", line 866, in fromstring
    is_full_html = _looks_like_full_html_unicode(html)
TypeError: expected string or buffer

Thoughts?


Solution

  • xpath() method returns a list of results, fromstring() expects a string.

    But, you don't need to reparse the part of the document. Just use what you've already found:

    def wordorigins(word):
        pageopen = lxml.html.fromstring("http://www.merriam-webster.com/dictionary/" + str(word))
        pbody = pageopen.xpath("/html/body/div[1]/div/div[4]/div/div[1]/main/article/div[5]/div[3]/div[1]/div/p[1]")[0]
        etytxt = pbody.text_content()
        etytxt = etytxt.replace("<em>", "")
        etytxt = etytxt.replace("</em>", "")
        return etytxt
    

    Note that I'm using text_content() method instead of the xpath("text()").