pythonpython-3.xweb-scrapingweb-crawler

How can I download PDF's using an AI WebCrawler? (Crawler4AI)


I have been using Crawler4AI to try downloading a series of documents from this Website. However, since it requieres JavaScript code and I am using Python, I don't know hot to solve my error.

Code, straight out of the docs with changes on the variable js_code

from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
from crawl4ai import AsyncWebCrawler
import os, asyncio
from pathlib import Path

async def download_multiple_files(url: str, download_path: str):
    config = BrowserConfig(accept_downloads=True, downloads_path=download_path)
    async with AsyncWebCrawler(config=config) as crawler:
        run_config = CrawlerRunConfig(
            js_code="""
                const downloadLinks = document.querySelectorAll(
                  "a[title='Download']",
                  document,
                  null,
                  XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,
                  null
                );
                for (const link of downloadLinks) {
                  link.click();              
                }
            """,
            wait_for=10  # Wait for all downloads to start
        )
        result = await crawler.arun(url=url, config=run_config)

        if result.downloaded_files:
            print("Downloaded files:")
            for file in result.downloaded_files:
                print(f"- {file}")
        else:
            print("No files downloaded.")

# Usage
download_path = os.path.join(Path.cwd(), "Downloads")
os.makedirs(download_path, exist_ok=True)

asyncio.run(download_multiple_files("https://data.humdata.org/dataset/repository-for-pdf-files", download_path))

Error:

 × Unexpected error in _crawl_web at line 1551 in _crawl_web (.venv\Lib\site-                                          │
│ packages\crawl4ai\async_crawler_strategy.py):                                                                         │
│   Error: Wait condition failed: 'int' object has no attribute 'strip'                                                 │
│                                                                                                                       │
│   Code context:                                                                                                       │
│   1546                   try:                                                                                         │
│   1547                       await self.smart_wait(                                                                   │
│   1548                           page, config.wait_for, timeout=config.page_timeout                                   │
│   1549                       )
│
│   1550                   except Exception as e:
│
│   1551 →                     raise RuntimeError(f"Wait condition failed: {str(e)}")
│
│   1552
│
│   1553               # Update image dimensions if needed
│
│   1554               if not self.browser_config.text_mode:
│
│   1555                   update_image_dimensions_js = load_js_script("update_image_dimensions")
│
│   1556                   try:
│

I was expecting to be able to download all documents from said dataset. By the way, it is legit to do webscrapping with this crawler according to their robots.txt file

User-agent: *
Disallow: /dataset/rate/
Disallow: /revision/
Disallow: /dataset/*/history
Disallow: /api/
Disallow: /user/*/api-tokens
Crawl-Delay: 10


Solution

  • The error occurs because wait_for config paramenter is interpreted incorrectly. According to the error, it should not be integer but string. According to the Crawl4AI docs, wait_for should be a CSS selector or a JavaScript expression. So you can modify your script to use wait_for="css:#dataset-resources" (which seems to be the container for the dataset download links).