I have been using Crawler4AI to try downloading a series of documents from this Website. However, since it requieres JavaScript code and I am using Python, I don't know hot to solve my error.
Code, straight out of the docs with changes on the variable
js_code
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
from crawl4ai import AsyncWebCrawler
import os, asyncio
from pathlib import Path
async def download_multiple_files(url: str, download_path: str):
config = BrowserConfig(accept_downloads=True, downloads_path=download_path)
async with AsyncWebCrawler(config=config) as crawler:
run_config = CrawlerRunConfig(
js_code="""
const downloadLinks = document.querySelectorAll(
"a[title='Download']",
document,
null,
XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,
null
);
for (const link of downloadLinks) {
link.click();
}
""",
wait_for=10 # Wait for all downloads to start
)
result = await crawler.arun(url=url, config=run_config)
if result.downloaded_files:
print("Downloaded files:")
for file in result.downloaded_files:
print(f"- {file}")
else:
print("No files downloaded.")
# Usage
download_path = os.path.join(Path.cwd(), "Downloads")
os.makedirs(download_path, exist_ok=True)
asyncio.run(download_multiple_files("https://data.humdata.org/dataset/repository-for-pdf-files", download_path))
Error:
× Unexpected error in _crawl_web at line 1551 in _crawl_web (.venv\Lib\site- │
│ packages\crawl4ai\async_crawler_strategy.py): │
│ Error: Wait condition failed: 'int' object has no attribute 'strip' │
│ │
│ Code context: │
│ 1546 try: │
│ 1547 await self.smart_wait( │
│ 1548 page, config.wait_for, timeout=config.page_timeout │
│ 1549 )
│
│ 1550 except Exception as e:
│
│ 1551 → raise RuntimeError(f"Wait condition failed: {str(e)}")
│
│ 1552
│
│ 1553 # Update image dimensions if needed
│
│ 1554 if not self.browser_config.text_mode:
│
│ 1555 update_image_dimensions_js = load_js_script("update_image_dimensions")
│
│ 1556 try:
│
I was expecting to be able to download all documents from said dataset. By the way, it is legit to do webscrapping with this crawler according to their robots.txt file
User-agent: *
Disallow: /dataset/rate/
Disallow: /revision/
Disallow: /dataset/*/history
Disallow: /api/
Disallow: /user/*/api-tokens
Crawl-Delay: 10
The error occurs because wait_for
config paramenter is interpreted incorrectly. According to the error, it should not be integer but string. According to the Crawl4AI docs, wait_for
should be a CSS selector or a JavaScript expression. So you can modify your script to use wait_for="css:#dataset-resources"
(which seems to be the container for the dataset download links).