I am parsing a news website, start point main-url is a web page of list of news. And I want to parse out each sub-url of news, and request these sub-url to get corresponding html. Then if the main-url still have next page, I will request to the next page. Here is my code:
def parse(self, response):
res_json = json.loads(response.text)
if res_json['code'] == 200:
news_list = page_info['list']
next_page_num = page_info['nextPageNum']
has_next = page_info['hasNext']
for news in news_list:
news_url = page_info['shareUrl']
item['url'] = news_url
yield response.follow(news_url, headers=self.headers, callback=self.parse_news_html, cb_kwargs=dict(item=item))
if has_next is True:
self.body = ('"pageNum":{}').format(next_page_num)
url = 'https://api.abcd.com/www/newsList/channelNewsList'
yield scrapy.Request(url=url,
method='POST',
body=self.body,
callback=self.parse)
def parse_news_html(self, response, item):
item['page_html'] = response.text
return item
However, in this manner, pipeline cannot get item, and main-url cannot switch to next page. As far as I know, a parse method could return a item, and or iterable of Requests. But cannot return two iterable of Requests. So how should I organize my code, so that I can return the item to pipeline and request next page.
Actually, I use requests.get() to query sub-url and return item can resolve the problem. I want to not depend on requests lib, because I think use the scrapy request itself to solve the problem is more "formal". is that correct?
Appreciate the help from wRAR and Alexander. Actually I didn't configure my pipeline in settings.py. Thanks for your time. Best, KS