pythonselenium-webdriverweb-scrapingscrapyplaywright

How to Scrape a JavaScript-Rendered Table? (wait_for_selector Timeout & Data Not Loading)


I'm trying to scrape a table from a webpage, but the table is dynamically loaded via JavaScript and appears 5-7 seconds after page load when viewed manually.

However, when using a web scraper, the table does not load at all or times out. I’ve tried multiple approaches (Playwright, Selenium, BeautifulSoup, and Scrapy), but none seem to work.

What I’ve Tried:

Waiting longer for JavaScript to render the table

Increased timeout (wait_for_selector in Playwright, WebDriverWait in Selenium). Added sleep() before scraping. Ensuring the correct selector

The table exists inside:

<div class="col-sm-12">
  <table id="data-table" class="table ...">...</table>
</div>

I verified div.col-sm-12 table#data-table in Chrome DevTools, and it matches the actual table.

Trying different scraping tools

Playwright: wait_for_selector() times out even after 20+ seconds. Selenium: WebDriverWait still doesn't detect the table. BeautifulSoup: requests.get() only returns the initial page source without the table. Scrapy: Table is missing from the response.

Triggering JavaScript events

Tried scrolling down (execute_script("window.scrollTo(0, document.body.scrollHeight)")). Simulated clicks to see if the table needs interaction. Manually checked for AJAX requests in Network tab, but didn’t find any obvious API calls.

Issue:

The table is not present in the initial HTML response. JavaScript takes 5-7 seconds to load it, but scrapers don’t seem to detect it. Works fine if I manually copy-paste the table’s HTML and parse it with Pandas. My Code (Using Playwright as an Example)

(But I’m open to Selenium, Scrapy, or other suggestions!)

import asyncio
from playwright.async_api import async_playwright
import pandas as pd

BASE_URL = "https://www.chimwiini.com/p/chimwiini-dictionary.html"

async def scrape_page(page):
    """Scrape the page and return table data."""
    try:
        await page.goto(BASE_URL, wait_until="load")
        await asyncio.sleep(10)  # Giving extra time for JavaScript to load

        # Try waiting for the table
        await page.wait_for_selector("div.col-sm-12 table#data-table", timeout=20000)

        # Extract table HTML
        table_html = await page.inner_html("div.col-sm-12 table#data-table")
        tables = pd.read_html(f"<table>{table_html}</table>")  # Parse with pandas

        return tables[0] if tables else None

    except Exception as e:
        print(f"Error: {e}")
        return None

async def main():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        data = await scrape_page(page)
        await browser.close()

        if data is not None:
            print(data.head())  # Print sample data
        else:
            print("No data scraped.")

asyncio.run(main())

Here is the error:

runfile('C:/Users/Gaming/anaconda3/Lib/site-packages/spyder_kernels/untitled0.py', wdir='C:/Users/Gaming/anaconda3/Lib/site-packages/spyder_kernels')
Scraping page 1...
Error scraping page 1: Page.wait_for_selector: Timeout 20000ms exceeded.
Call log:
  - waiting for locator("div.col-sm-12 table#data-table") to be visible

My Questions:

  1. How can I ensure the table is fully loaded before scraping?
  2. Do I need to trigger specific JavaScript events to make it appear?
  3. Is there a way to detect the exact API call that loads the table?
  4. If JavaScript loads data asynchronously, can I intercept or extract it differently?

I’m open to Playwright, Selenium, Scrapy, or any other approach that works.

Any help is greatly appreciated.


Solution

  • I was able to get it done with just python/Selenium. A few things:

    1. The desired TABLE is buried several IFRAMEs deep
    2. You need to use WebDriverWait to wait for the desired TABLE to be visible to ensure that the table data load is complete

    I threw the data into a pandas DataFrame to make it print pretty. You can use it or remove it as you wish.

    Here's the working code:

    import pandas as pd
    
    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    driver = webdriver.Chrome()
    
    url = "https://www.chimwiini.com/p/chimwiini-dictionary.html"
    driver.get(url)
    
    # wait and switch into each of the IFRAMEs so we can access the TABLE
    wait = WebDriverWait(driver, 10)
    wait.until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR, "#main-wrapper iframe")))
    wait.until(EC.frame_to_be_available_and_switch_to_it((By.ID, "sandboxFrame")))
    wait.until(EC.frame_to_be_available_and_switch_to_it((By.ID, "userHtmlFrame")))
    
    # grab the table headers
    headings = []
    for th in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "#data-table thead th"))):
        headings.append(th.text)
    
    # grab each row
    rows = []
    for table_row in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "#data-table tbody tr"))):
        cells = []
        for cell in table_row.find_elements(By.CSS_SELECTOR, "td"):
            cells.append(cell.text)
        rows.append(cells)
    
    # create and print the DataFrame
    df = pd.DataFrame(rows, columns=headings)
    print(df)
    
    driver.quit()
    

    It prints

       Chimwiini Word           English Words Chimwiini Synonyms          English Synonyms
    0         Aakhiri                    Last
    1          Aarani
    2          Abaari
    3           Abadi  Constantly, Frequently                     All the time, Everytime,
    4  Abbay Faatduma
    5         Achaari        Spicy condiment,                       Pickle masala, chutney
    6           Adabu         Polite, Manners
    7           Adadi                  Amount
    8           Aduwi                   Enemy
    9          Afisha                 Forgive