pythonweb-scrapinglxml

Screen scraping in LXML with python-- extract specific data


I've been trying to write a program for the last several hours that does what I thought would be an incredibly simple task:

  1. Program asks for user input (let's say the type 'happiness')
  2. Program queries the website thinkexist using this format ("http://thinkexist.com/search/searchQuotation.asp?search=USERINPUT")
  3. Program returns first quote from the website.

I've tried using Xpath with lxml, but have no experience and every single construction comes back with a blank array.

The actual meat of the quote appears to be contained in the class "sqq."

If I navigate the site via Firebug, click the DOM tab, it appears the quote is in a textNode attribute "wholeText" or "textContent"-- but I don't know how to use that knowledge programatically.

Any ideas?


Solution

  • If it's not necessary for you to implement this via XPath, you may use BeautifilSoup library like this (let myXml variable contain the page HTML source):

    soup = BeautifulSoup(myXml)
    for a in soup.findAll(a,{'class' : 'sqq'}):
      # this is your quote
      print a.contents
    

    Anyway, read the BS documentation, it may be very useful for some scraping needs that don't require the power of XPath.