I am using Splash with Scrapy to load dynamically rendered content in a page, but it does not work as I expected.
In setting.py
I set these variables:
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
}
SPLASH_URL="http://localhost:8050"
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
SPLASH_COOKIES_DEBUG = False
The spider:
def start_requests(self):
urls = [
"https://callmeduy.com/san-pham/"
]
for url in urls:
yield SplashRequest(url=url,
# endpoint='render.html',
callback=self.parse,
args={
'wait': 5
})
def parse(self, response):
print(response.xpath("//body").get())
f = open('res.html', 'w+')
f.write(response.xpath("//body").get())
f.close()
The dynamic content has not been loaded. Here is the response body.
Please help if anybody knows.
I couldn't get this to work with Splash. Probably because I'm not very familiar with it.
However, I have a working solution that uses Scrapy and Playwright.
This is requirements.txt
:
Scrapy==2.11.2
playwright==1.44.0
scrapy-playwright==0.0.35
beautifulsoup4==4.12.3
In settings.py
:
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT = (
30 * 1000
)
PLAYWRIGHT_BROWSER_TYPE = "chromium"
PLAYWRIGHT_LAUNCH_OPTIONS = {
"headless": False
}
And the spider:
import scrapy
import time
from bs4 import BeautifulSoup
class CallmeduySpider(scrapy.Spider):
name = "callmeduy"
allowed_domains = ["callmeduy.com"]
def start_requests(self):
url = "https://callmeduy.com/san-pham"
yield scrapy.Request(
url,
meta=dict(
playwright=True,
playwright_include_page=True,
),
)
async def parse(self, response):
page = response.meta["playwright_page"]
while True:
soup = BeautifulSoup(await page.content(), "lxml")
wait = soup.select_one(".card-title.h5 > span span.react-loading-skeleton")
if not wait:
self.logger.debug("====================================================")
for card in soup.select(".jss23 .row .col-12"):
link = card.select_one("a.jss29")
title = card.select_one(".card-title.h5 > span.jss31")
self.logger.debug(title.get_text())
self.logger.debug(link["href"])
# TODO: Probably yield another scrapy.Request() here for each product?
self.logger.debug("====================================================")
return
else:
self.logger.info("Waiting for skeleton to load.")
time.sleep(5)
The key thing was to ensure that the content of the page is fully rendered. This is a rather brute-force way to do it (and could probably be done a lot more elegantly!), but it's a pragmatic solution.
Sample output:
2024-06-17 09:25:41 [callmeduy] DEBUG: ====================================================
2024-06-17 09:25:41 [callmeduy] DEBUG: Sữa Chống Nắng Bí Đao Cocoon
2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1451
2024-06-17 09:25:41 [callmeduy] DEBUG: COSRX The Hyaluronic Acid 3...
2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1450
2024-06-17 09:25:41 [callmeduy] DEBUG: Kem chống nắng Skin1004 Mad...
2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1449
2024-06-17 09:25:41 [callmeduy] DEBUG: COSRX The Niacinamide 15% S...
2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1448
2024-06-17 09:25:41 [callmeduy] DEBUG: Nacific Origin Red Salicyli...
2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1446
2024-06-17 09:25:41 [callmeduy] DEBUG: Skin1004 Madagascar Centell...
2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1445
2024-06-17 09:25:41 [callmeduy] DEBUG: ACNACARE GEL Mega We Care
2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1444
2024-06-17 09:25:41 [callmeduy] DEBUG: Viên uống ACNACARE Mega We ...
2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1443
2024-06-17 09:25:41 [callmeduy] DEBUG: Serum NNO VITE Mega We Care
2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1442
2024-06-17 09:25:41 [callmeduy] DEBUG: Neutrogena Hydro Boost Acti...
2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1441
2024-06-17 09:25:41 [callmeduy] DEBUG: Neutrogena Hydroboost Clean...
2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1440
2024-06-17 09:25:41 [callmeduy] DEBUG: Skin Recovery Cream
2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1439
2024-06-17 09:25:41 [callmeduy] DEBUG: ====================================================
As noted in a comment in the code, you will probably want to follow those links to the actual product pages (or perhaps not?). You'll also need to handle the pagination of the results on the index page.
However, this code should get you started with something that at least yields results.