scrapyreferer

Scrapy Spider error processing correct link


The start_url in this spider seems to be causing a problem but I am unsure why. Here is the project breakdown.

import scrapy
from statements.items import StatementsItem


class IncomeannualSpider(scrapy.Spider):
    name = 'incomeannual'
    start_urls = ['https://www.marketwatch.com/investing/stock/A/financials']

    def parse(self, response):
        item = {}

        item['ticker'] = response.xpath("//h1[contains(@id, 'instrumentname')]//text()").extract()
        item['sales2014'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Sales/Revenue']]/text()").extract()[0]
        item['sales2015'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Sales/Revenue']]/text()").extract()[1]
        item['sales2016'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Sales/Revenue']]/text()").extract()[2]
        item['sales2017'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Sales/Revenue']]/text()").extract()[3]
        item['sales2018'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Sales/Revenue']]/text()").extract()[4]
        item['sales2014rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Sales Growth']]/text()").extract()[0]
        item['sales2015rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Sales Growth']]/text()").extract()[1]
        item['sales2016rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Sales Growth']]/text()").extract()[2]
        item['sales2017rate'] = response.xpath("//td[./preceding- sibling::td[normalize-space()='Sales Growth']]/text()").extract()[3]
        item['sales2018rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Sales Growth']]/text()").extract()[4]
        item['cogs2014'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Cost of Goods Sold (COGS) incl. D&A']]/text()").extract()[0]
        item['cogs2015'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Cost of Goods Sold (COGS) incl. D&A']]/text()").extract()[1]
        item['cogs2016'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Cost of Goods Sold (COGS) incl. D&A']]/text()").extract()[2]
        item['cogs2017'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Cost of Goods Sold (COGS) incl. D&A']]/text()").extract()[3]
        item['cogs2018'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Cost of Goods Sold (COGS) incl. D&A']]/text()").extract()[4]
        item['cogs2014rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='COGS Growth']]/text()").extract()[0]
        item['cogs2015rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='COGS Growth']]/text()").extract()[1]
        item['cogs2016rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='COGS Growth']]/text()").extract()[2]
        item['cogs2017rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='COGS Growth']]/text()").extract()[3]
        item['cogs2018rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='COGS Growth']]/text()").extract()[4]
        item['pretaxincome2014'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Pretax Income']]/text()").extract()[0]
        item['pretaxincome2015'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Pretax Income']]/text()").extract()[1]
        item['pretaxincome2016'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Pretax Income']]/text()").extract()[2]
        item['pretaxincome2017'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Pretax Income']]/text()").extract()[3]
        item['pretaxincome2018'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Pretax Income']]/text()").extract()[4]
        item['pretaxincome2014rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Pretax Income Growth']]/text()").extract()[0]
        item['pretaxincome2015rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Pretax Income Growth']]/text()").extract()[1]
        item['pretaxincome2016rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Pretax Income Growth']]/text()").extract()[2]
        item['pretaxincome2017rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Pretax Income Growth']]/text()").extract()[3]
        item['pretaxincome2018rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Pretax Income Growth']]/text()").extract()[4]
        item['netincome2014'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Net Income']]/text()").extract()[0]
        item['netincome2015'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Net Income']]/text()").extract()[1]
        item['netincome2016'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Net Income']]/text()").extract()[2]
        item['netincome2017'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Net Income']]/text()").extract()[3]
        item['netincome2018'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Net Income']]/text()").extract()[4]
        item['netincome2014rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Net Income Growth']]/text()").extract()[0]
        item['netincome2015rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Net Income Growth']]/text()").extract()[1]
        item['netincome2016rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Net Income Growth']]/text()").extract()[2]
        item['netincome2017rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Net Income Growth']]/text()").extract()[3]
        item['netincome2018rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Net Income Growth']]/text()").extract()[4]
        item['eps2014'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EPS (Basic)']]/text()").extract()[0]
        item['eps2015'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EPS (Basic)']]/text()").extract()[1]
        item['eps2016'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EPS (Basic)']]/text()").extract()[2]
        item['eps2017'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EPS (Basic)']]/text()").extract()[3]
        item['eps2018'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EPS (Basic)']]/text()").extract()[4]
        item['eps2014rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EPS (Basic) Growth']]/text()").extract()[0]
        item['eps2015rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EPS (Basic) Growth']]/text()").extract()[1]
        item['eps2016rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EPS (Basic) Growth']]/text()").extract()[2]
        item['eps2017rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EPS (Basic) Growth']]/text()").extract()[3]
        item['eps2018rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EPS (Basic) Growth']]/text()").extract()[4]
        item['eps2014altrate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EPS (Basic) - Growth']]/text()").extract()[0]
        item['eps2015altrate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EPS (Basic) - Growth']]/text()").extract()[1]
        item['eps2016altrate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EPS (Basic) - Growth']]/text()").extract()[2]
        item['eps2017altrate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EPS (Basic) - Growth']]/text()").extract()[3]
        item['eps2018altrate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EPS (Basic) - Growth']]/text()").extract()[4]
        item['ebitda2014'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EBITDA']]/text()").extract()[0]
        item['ebitda2015'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EBITDA']]/text()").extract()[1]
        item['ebitda2016'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EBITDA']]/text()").extract()[2]
        item['ebitda2017'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EBITDA']]/text()").extract()[3]
        item['ebitda2018'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EBITDA']]/text()").extract()[4]
        item['ebitda2014rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EBITDA Growth']]/text()").extract()[0]
        item['ebitda2015rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EBITDA Growth']]/text()").extract()[1]
        item['ebitda2016rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EBITDA Growth']]/text()").extract()[2]
        item['ebitda2017rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EBITDA Growth']]/text()").extract()[3]
        item['ebitda2018rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EBITDA Growth']]/text()").extract()[4]

        yield item

All of the xpaths were checked with the start_url in the shell and seem to be working just fine.

2019-03-17 10:25:06 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: 
statements)
2019-03-17 10:25:06 [scrapy.utils.log] INFO: Versions: lxml 4.3.1.0, libxml2 
2.9.5, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 
3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 23:09:28) [MSC v.1916 64 bit 
(AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1a  20 Nov 2018), cryptography 2.5, 
Platform Windows-10-10.0.17763-SP0
2019-03-17 10:25:06 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 
'statements', 'FEED_EXPORT_ENCODING': 'utf-8', 'FEED_FORMAT': 'csv', 
'FEED_URI': 'sdasda.csv', 'NEWSPIDER_MODULE': 'statements.spiders', 
'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['statements.spiders'], 
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 
(KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}
2019-03-17 10:25:06 [scrapy.extensions.telnet] INFO: Telnet Password: 
3580241d541f00bb
2019-03-17 10:25:06 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2019-03-17 10:25:06 [scrapy.middleware] INFO: Enabled downloader 
middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'statements.middlewares.StatementsDownloaderMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-03-17 10:25:06 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-03-17 10:25:06 [scrapy.middleware] INFO: Enabled item pipelines:
['statements.pipelines.StatementsPipeline']
2019-03-17 10:25:06 [scrapy.core.engine] INFO: Spider opened
2019-03-17 10:25:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 
pages/min), scraped 0 items (at 0 items/min)
2019-03-17 10:25:06 [incomeannual] INFO: Spider opened: incomeannual
2019-03-17 10:25:06 [scrapy.extensions.telnet] INFO: Telnet console 
listening on 127.0.0.1:6024
2019-03-17 10:25:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET 
https://www.marketwatch.com/robots.txt> (referer: None)
2019-03-17 10:25:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET 
https://www.marketwatch.com/investing/stock/A/financials> (referer: None)
2019-03-17 10:25:07 [scrapy.core.scraper] ERROR: Spider error processing 
<GET https://www.marketwatch.com/investing/stock/A/financials> (referer: 
None)
Traceback (most recent call last):
  File "c:\users\jesse\appdata\local\programs\python\python37\lib\site- 
   packages\scrapy\utils\defer.py", line 102, in iter_errback
        yield next(it)
  File "c:\users\jesse\appdata\local\programs\python\python37\lib\site- 
   packages\scrapy\spidermiddlewares\offsite.py", line 29, in 
process_spider_output
    for x in result:
  File "c:\users\jesse\appdata\local\programs\python\python37\lib\site- 
   packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "c:\users\jesse\appdata\local\programs\python\python37\lib\site- 
   packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "c:\users\jesse\appdata\local\programs\python\python37\lib\site- 
   packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "C:\Users\Jesse\Files\Financial\statements\statements\spiders\incomeannual.py", 
line 64, in parse
    item['eps2014altrate'] = response.xpath("//td[./preceding- 
sibling::td[normalize-space()='EPS (Basic) - Growth']]/text()").extract()[0]
IndexError: list index out of range
2019-03-17 10:25:07 [scrapy.core.engine] INFO: Closing spider (finished)
2019-03-17 10:25:07 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 636,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 25693,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 3, 17, 14, 25, 7, 786531),
 'log_count/DEBUG': 2,
 'log_count/ERROR': 1,
 'log_count/INFO': 10,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'spider_exceptions/IndexError': 1,
 'start_time': datetime.datetime(2019, 3, 17, 14, 25, 6, 856319)}
2019-03-17 10:25:07 [scrapy.core.engine] INFO: Spider closed (finished)

This site requires a the USER_AGENT setting to be enabled to allow scraping. I've tried to work with specifying headers in the settings.py but this spider will actually be using over 5000 start_urls and I'm not sure how to use this setting with multiple urls. I've used this setup with multiple other projects and they work fine.

Any advice will be very much appreciated! Thanks!


Solution

  • The error in your log is because that specific XPath returns nothing (tested in scrapy shell):

    >>> response.xpath("//td[./preceding-sibling::td[normalize-space()='EPS (Basic) - Growth']]/text()").extract()
    []
    >>> response.xpath("//td[./preceding-sibling::td[normalize-space()='EPS (Basic) - Growth']]/text()").extract()[0]
    Traceback (most recent call last):
      File "<console>", line 1, in <module>
    IndexError: list index out of range
    

    You need to check the length of the selector result before getting an index, because it is not safe to assume that an index exists. There are various shorthand solutions here: Get value at list/array index or "None" if out of range in Python

    Here is one example:

    values = response.xpath("//td[./preceding-sibling::td[normalize-space()='EPS (Basic) - Growth']]/text()").extract()
    item['eps2014altrate'] = value[0] if 0 < len(values) else None
    item['eps2015altrate'] = value[1] if 1 < len(values) else None
    item['eps2016altrate'] = value[2] if 2 < len(values) else None
    item['eps2017altrate'] = value[3] if 3 < len(values) else None
    item['eps2018altrate'] = value[4] if 4 < len(values) else None
    

    You can make it a bit less verbose by writing a helper function, like this. Either way, you should use this pattern everywhere, not just for the failing XPath.