pythonunit-testingscrapynose

Scrapy Unit Testing


I'd like to implement some unit tests in a Scrapy (screen scraper/web crawler). Since a project is run through the "scrapy crawl" command I can run it through something like nose. Since scrapy is built on top of twisted can I use its unit testing framework Trial? If so, how? Otherwise I'd like to get nose working.

Update:

I've been talking on Scrapy-Users and I guess I am supposed to "build the Response in the test code, and then call the method with the response and assert that [I] get the expected items/requests in the output". I can't seem to get this to work though.

I can build a unit-test test class and in a test:

However it ends up generating this traceback. Any insight as to why?


Solution

  • The way I've done it is create fake responses, this way you can test the parse function offline. But you get the real situation by using real HTML.

    A problem with this approach is that your local HTML file may not reflect the latest state online. So if the HTML changes online you may have a big bug, but your test cases will still pass. So it may not be the best way to test this way.

    My current workflow is, whenever there is an error I will sent an email to admin, with the url. Then for that specific error I create a html file with the content which is causing the error. Then I create a unittest for it.

    This is the code I use to create sample Scrapy http responses for testing from an local html file:

    # scrapyproject/tests/responses/__init__.py
    
    import os
    
    from scrapy.http import Response, Request
    
    def fake_response_from_file(file_name, url=None):
        """
        Create a Scrapy fake HTTP response from a HTML file
        @param file_name: The relative filename from the responses directory,
                          but absolute paths are also accepted.
        @param url: The URL of the response.
        returns: A scrapy HTTP response which can be used for unittesting.
        """
        if not url:
            url = 'http://www.example.com'
    
        request = Request(url=url)
        if not file_name[0] == '/':
            responses_dir = os.path.dirname(os.path.realpath(__file__))
            file_path = os.path.join(responses_dir, file_name)
        else:
            file_path = file_name
    
        file_content = open(file_path, 'r').read()
    
        response = Response(url=url,
            request=request,
            body=file_content)
        response.encoding = 'utf-8'
        return response
    

    The sample html file is located in scrapyproject/tests/responses/osdir/sample.html

    Then the testcase could look as follows: The test case location is scrapyproject/tests/test_osdir.py

    import unittest
    from scrapyproject.spiders import osdir_spider
    from responses import fake_response_from_file
    
    class OsdirSpiderTest(unittest.TestCase):
    
        def setUp(self):
            self.spider = osdir_spider.DirectorySpider()
    
        def _test_item_results(self, results, expected_length):
            count = 0
            permalinks = set()
            for item in results:
                self.assertIsNotNone(item['content'])
                self.assertIsNotNone(item['title'])
            self.assertEqual(count, expected_length)
    
        def test_parse(self):
            results = self.spider.parse(fake_response_from_file('osdir/sample.html'))
            self._test_item_results(results, 10)
    

    That's basically how I test my parsing methods, but its not only for parsing methods. If it gets more complex I suggest looking at Mox