scrapyscrapy-shell

How should I organize code, if I want to request url parse from start_requests(), and also need to request to switch to next page


I am parsing a news website, start point main-url is a web page of list of news. And I want to parse out each sub-url of news, and request these sub-url to get corresponding html. Then if the main-url still have next page, I will request to the next page. Here is my code:

    def parse(self, response):
        res_json = json.loads(response.text)
        if res_json['code'] == 200:
            news_list = page_info['list']
            next_page_num = page_info['nextPageNum']
            has_next = page_info['hasNext']
            for news in news_list:
                news_url = page_info['shareUrl']
                item['url'] = news_url
                yield response.follow(news_url, headers=self.headers, callback=self.parse_news_html, cb_kwargs=dict(item=item))

            if has_next is True:
                self.body = ('"pageNum":{}').format(next_page_num)
                url = 'https://api.abcd.com/www/newsList/channelNewsList'
                yield scrapy.Request(url=url,
                                     method='POST',
                                     body=self.body,
                                     callback=self.parse)

    def parse_news_html(self, response, item):
        item['page_html'] = response.text
        return item

However, in this manner, pipeline cannot get item, and main-url cannot switch to next page. As far as I know, a parse method could return a item, and or iterable of Requests. But cannot return two iterable of Requests. So how should I organize my code, so that I can return the item to pipeline and request next page.

Actually, I use requests.get() to query sub-url and return item can resolve the problem. I want to not depend on requests lib, because I think use the scrapy request itself to solve the problem is more "formal". is that correct?


Solution

  • Appreciate the help from wRAR and Alexander. Actually I didn't configure my pipeline in settings.py. Thanks for your time. Best, KS