Probably you have realized by title, I am using scrapy and xpath to extract data. I tried and provided xpaths from file to the spider (to make spider generic - not to edit often) As required, I am able to extract data in the format required.
Further, now I want to check the xpath expression (relative to webpage specified in spider) if the xpath provided is valid or not (incase if the html style has changed, then my xpath will be invalid). Regarding this I want to check my xpath expression before spider starts.
How do I test my xpath's correctness? or is there any way to do truth testing? Please help.
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["example.com"]
start_urls = ["file:///<filepath>.html"]
def __init__(self):
self.mt = ""
def parse(self, response):
respDta = dict()
it_lst = []
dtData = response.selector.xpath(gx.spcPth[0])
for ra in dtData:
comoodityObj = ra.xpath(gx.spcPth[1])
list = comoodityObj.extract()
cmdNme = list[0].replace(u'\xa0', u' ')
cmdNme = cmdNme.replace("Header text: ", '')
self.populate_item(response, respDta, cmdNme, it_lst, list[0])
respDta["mt"] = self.mt
jsonString = json.dumps(respDta, default=lambda o: o.__dict__)
return jsonString
gx.spcPth
gx.spcPth is from other function which provides me xpath. And it has been used in many instances in rest of the code. I need to check xpath expression before spider starts throughout the code, wherever implemented
I understand what you are trying to do, I just don't see why. The whole process of running a spider is in the same time your "testing" process - simple as this: if the result of xpath is empty and it should return something, than something is wrong. Why don't you just check the xpath results and use the scrapy logging to mark it as a warning, error or critical, whatever you want. Simple as this:
from scrapy import log
somedata = response.xpath(my_supper_dupper_xpath)
# we know that this should have captured
# something, so we check
if not somedata:
log.msg("This should never happen, XPath's are all wrong, OMG!", level=log.CRITICAL)
else:
# do your actual parsing of the captured data,
# that we now know exists
After that, just run your spider and relax. When you see critical messages in your output log, you'll know its time to shit bricks. Otherwise, everything is ok.