pythonplaywrightplaywright-python

google search using playwright


I am trying to perform google search using playwright. But getting this error:

playwright._impl._errors.TimeoutError: Page.fill: Timeout 60000ms exceeded.
Call log:
waiting for locator("input[name=\"q\"]")

Here is the code:

from playwright.async_api import async_playwright
import asyncio


async def main():
    async with async_playwright() as pw:
        browser = await pw.chromium.launch(args=["--disable-gpu", "--single-process", "--headless=new"], headless=True)

        page = await browser.new_page()

        # Go to Google
        await page.goto('https://www.google.com')

        # Accept the cookies prompt (if it appears)
        try:
            accept_button = await page.wait_for_selector('button:has-text("I agree")', timeout=5000)
            if accept_button:
                await accept_button.click()
        except:
            pass

        # Search for a query
        query = "Playwright Python"
        await page.fill('input[name="q"]', query, timeout=60000)
        await page.press('input[name="q"]', 'Enter')

        # Wait for the results to load
        await page.wait_for_selector('h3')

        # Extract the first result's link
        first_result = await page.query_selector('h3')
        first_link = await first_result.evaluate('(element) => element.closest("a").href')
        print("First search result link:", first_link)

        await browser.close()

if __name__ == '__main__':
    asyncio.run(main())


Solution

  • As mentioned in a comment, changing input to textarea fixes the problem.

    That said, a common XY problem/antipattern in scraping is assuming you must act like a human. That only applies to testing. In scraping, you can take shortcuts that get the result faster, more reliably, and with less code. Skip over extraneous pages, don't click buttons unless you have to, rip out cookie banners rather than dutifully accepting them, skip visibility checks and trusted events. You can even intercept requests or hit the API directly.

    Putting that to practice here, you can navigate directly to the results page in one nav, and be done with it, avoiding the fuss, slowness and unreliability of automating the landing page:

    import asyncio
    import urllib.parse
    from playwright.async_api import async_playwright  # 1.44.0
    
    
    async def main():
        term = urllib.parse.quote_plus("playwright python")
        url = f"https://www.google.com/search?q={term}"
    
        async with async_playwright() as pw:
            browser = await pw.chromium.launch()
            page = await browser.new_page()
            await page.goto(url, wait_until="domcontentloaded")
            link = await page.locator("h3").first.evaluate('el => el.closest("a").href')
            print("First search result link:", link)
            await browser.close()
    
    
    if __name__ == "__main__":
        asyncio.run(main())
    

    Avoid wait_for_selector and query_selector--prefer locators.

    Beyond that, the data is in the static HTML, so you don't even need Playwright, assuming you're comfortable doing a bit of URL processing (Google obfuscates results for some reason):

    import re
    import urllib.parse
    from bs4 import BeautifulSoup  # 4.10.0
    from requests import get  # 2.31.0
    
    
    term = urllib.parse.quote_plus("playwright python")
    response = get(f"https://www.google.com/search?q={term}")
    response.raise_for_status()
    soup = BeautifulSoup(response.text, "lxml")  # 4.8.0
    link = soup.select_one("h3").find_parent("a")["href"].replace(r"/url?q=", "")
    link = re.sub(r"&sa=U&.*", "", link)
    print("First search result link:", link)
    

    Speed test:

    # OP's playwright script after the input -> textarea fix
    real 0m7.487s
    user 0m2.804s
    sys  0m0.438s
    
    # my playwright script
    real 0m1.046s
    user 0m0.539s
    sys  0m0.136s
    
    # requests/beautiful soup
    real 0m0.521s
    user 0m0.157s
    sys  0m0.023s
    

    In some cases, it can be beneficial to block requests for resources you don't need, like images, stylesheets and so forth, and disable JS. Here's an example of doing this for posterity, but it doesn't help in this case, probably because the Google load is pretty fast and JS-light already:

    import asyncio
    import urllib.parse
    from playwright.async_api import async_playwright
    
    
    async def main():
        term = urllib.parse.quote_plus("playwright python")
        url = f"https://www.google.com/search?q={term}"
    
        async def handle_requests(route, request):
            if request.url.startswith(url):
                await route.continue_()
            else:
                await route.abort()
    
        async with async_playwright() as pw:
            browser = await pw.chromium.launch()
            context = await browser.new_context(java_script_enabled=False)
            page = await context.new_page()
            await page.route("**/*", handle_requests)
            await page.goto(url, wait_until="domcontentloaded")
            link = await page.locator("h3").first.evaluate('el => el.closest("a").href')
            print("First search result link:", link)
            await browser.close()
    
    
    if __name__ == "__main__":
        asyncio.run(main())