we want to scrape articles (content + headline) to extend our dataset for text classification purposes.
GOAL: scrape all articles from all pages at >> https://www.bbc.com/news/technology
PROBLEM: It seems like that the code only extracts the articles from https://www.bbc.com/news/technology?page=1, even tho, we follow all pages. Clould there be a problem in how we follow the pages?
class BBCSpider_2(scrapy.Spider):
name = "bbc_tech"
start_urls = ["https://www.bbc.com/news/technology"]
def parse(self, response: Response, **kwargs: Any) -> Any:
max_pages = response.xpath("//nav[@aria-label='Page']/div/div/div/div/ol/li[last()]/div/a//text()").get()
max_pages = int(max_pages)
for p in range(max_pages):
page = f"https://www.bbc.com/news/technology?page={p+1}"
yield response.follow(page, callback=self.parse_articles2)
Next, we are going into each article on the corresponding page:
def parse_articles2(self, response):
container_to_scan = [4, 8]
for box in container_to_scan:
if box == 4:
articles = response.xpath(f"//*[@id='main-content']/div[{box}]/div/div/ul/li")
if box == 8:
articles = response.xpath(f"//*[@id='main-content']/div[{box}]/div[2]/ol/li")
for article_idx in range(len(articles)):
if box == 4:
relative_url = response.xpath(f"//*[@id='main-content']/div[4]/div/div/ul/li[{article_idx+1}]/div/div/div/div[1]/div[1]/a/@href").get()
elif box == 8:
relative_url = response.xpath(f"//*[@id='main-content']/div[8]/div[2]/ol/li[{article_idx+1}]/div/div/div[1]/div[1]/a/@href").get()
else:
relative_url = None
if relative_url is not None:
followup_url = "https://www.bbc.com" + relative_url
yield response.follow(followup_url, callback=self.parse_article)
Last but not least we are scraping the content and title of each article:
def parse_article(response):
article_text = response.xpath("//article/div[@data-component='text-block']")
content = []
for box in article_text:
text = box.css("div p::text").get()
if text is not None:
content.append(text)
title = response.css("h1::text").get()
yield {
"title": title,
"content": content,
}
When we run this we get an items_scraped_count of 24. But it should be 24 x 29 +/- ...
It appears that your subesequent calls to page 2 and 3 and so on are being filtered by scrapy's duplicate filtering functionality, and the reason that is happening is because the site keeps serving the same front page no matter what page number you put into the url query. After rendering the front page it uses a json api to get the actual article information for the page requested, which isn't capable of being captured by scrapy alone unless you call the api directly.
The json api can be discovered in the your browsers dev tools in the network tab, or I use it in the example below. You simply need to enter in the desired page number similar to how you already were doing for .../news/technology?page=?
url. See the example below...
One other thing... your parse_article
method is missing the self
as the first parameter, which would throw an error and prevent you from actually scraping any of the page content. I also rewrote a couple of your xpaths, to make them a bit more readable.
import scrapy
class BBCSpider_2(scrapy.Spider):
name = "bbc_tech"
start_urls = ["https://www.bbc.com/news/technology"]
def parse(self, response):
max_pages = response.xpath("//nav[@aria-label='Page']//ol/li[last()]//text()").get()
for article in response.xpath("//div[@type='article']"):
if link := article.xpath(".//a[contains(@class, 'LinkPostLink')]/@href").get():
yield response.follow(link, callback=self.parse_article)
for i in range(2, int(max_pages)):
yield scrapy.Request(f"https://www.bbc.com/wc-data/container/topic-stream?adSlotType=mpu_middle&enableDotcomAds=true&isUk=false&lazyLoadImages=true&pageNumber={i}&pageSize=24&promoAttributionsToSuppress=%5B%22%2Fnews%22%2C%22%2Fnews%2Ffront_page%22%5D&showPagination=true&title=Latest%20News&tracking=%7B%22groupName%22%3A%22Latest%20News%22%2C%22groupType%22%3A%22topic%20stream%22%2C%22groupResourceId%22%3A%22urn%3Abbc%3Avivo%3Acuration%3Ab2790c4d-d5c4-489a-84dc-be0dcd3f5252%22%2C%22groupPosition%22%3A5%2C%22topicId%22%3A%22cd1qez2v2j2t%22%7D&urn=urn%3Abbc%3Avivo%3Acuration%3Ab2790c4d-d5c4-489a-84dc-be0dcd3f5252", callback=self.parse_json)
def parse_json(self, response):
for post in response.json()["posts"]:
yield scrapy.Request(response.urljoin(post["url"]), callback=self.parse_article)
def parse_article(self, response):
article_text = response.xpath("//article/div[@data-component='text-block']//text()").getall()
content = " ".join([i.strip() for i in article_text])
title = response.css("h1::text").get()
yield {
"title": title,
"content": content,
}