I am trying to get all search results (URLs) from https://docs.vgd.ru/search/?v=1. I am using the xpath //a[@class='resultsMain']
to find them. The xpath is valid.
My code:
import asyncio
import time
from playwright.async_api import async_playwright
class VgdParser:
def __init__(self, headless: bool = False):
self.headless = headless
self.browser = None
self.playwright = None
self.page = None
async def start_browser(self):
"""Start the browser and create a new page with stealth mode"""
self.playwright = await async_playwright().start()
self.browser = await self.playwright.chromium.launch(
headless=self.headless,
)
self.page = await self.browser.new_page()
await self.page.goto("https://docs.vgd.ru/search/?v=1")
async def search_by_name(self, name: str):
"""Enter name in https://docs.vgd.ru/search/?v=1 and collect results"""
# Wait for the iframe to appear
# Get the iframe (case-insensitive id)
frame = self.page.frame(name="iFrame1")
if frame is None:
raise Exception("iframe with id 'iFrame1' not found")
# Wait for the input inside the iframe
input_locator = frame.locator('//input[@placeholder="Введите запрос"]')
await input_locator.fill(name)
await input_locator.press('Enter')
frame = self.page.frame(name="iFrame1")
if frame is None:
raise Exception("iframe with id 'iFrame1' not found")
results_raw = frame.locator("//a[@class='resultsMain']")
count = await results_raw.count()
print("XXX_ ", count) # PRINTS 0
for i in range(count):
cur_result = results_raw.nth(i)
text = await cur_result.inner_text()
print("Result:", text)
async def main():
parser = VgdParser()
await parser.start_browser()
await parser.search_by_name("Алексей Ермаков")
if __name__ == "__main__":
asyncio.run(main())
The problem is that in the function search_by_name
, line print("XXX_ ", count)
, it prints 0 - meaning, that it didn't find elements.
A few thoughts:
Avoid non-waiting operations like .count()
. Those results take a second to show up, so you need to wait for them.
If the page has an iframe and you're ignoring the outer page, just navigate right to the iframe.
Avoid abstractions until your script is functionally working, then clean up afterwards. Often, premature abstractions get in the way.
Avoid XPath. CSS selectors read so much cleaner. XPath is only necessary in extremely rare circumstances where you need to move from a child to a parent in a way that CSS selectors can't support (but even in those cases, there's usually a better way).
If you are writing tests, avoid CSS selectors too in favor of user-visible, accessibility selectors, but I'm pretty sure you're scraping, so CSS selectors are fine.
Understand that locator declarations are not actions, and locators are basically never going to be null/None. So if frame is None:
is a no-op. Locators raise/throw when actions fail. Even if if frame is None:
worked as you expect, you're testing the same condition twice, so the second branch is unreachable. Remove these unnecessary safety checks, which are misleading/false assurances.
Here's a minimal rewrite, on which you can build abstractions on your own, if you need to:
from playwright.sync_api import sync_playwright # 1.53.0
with sync_playwright() as playwright:
browser = playwright.chromium.launch()
page = browser.new_page()
page.goto("https://vgd.ru/search2", wait_until="commit")
page.get_by_placeholder("Введите запрос").fill("Алексей Ермаков")
page.keyboard.press("Enter")
results = page.locator(".resultsMain")
results.first.wait_for()
print(results.all_text_contents())
Output:
['Colesnik', 'Pacific', 'masterbos', 'yokainfromabyss', 'OlgaDoronina1983', ... ]
Checking to see if you can intercept a network request or hit an API directly is probably worth exploring here. I haven't investigated this, just noting that the above is not necessarily an optimal strategy, per se, just a first step improvement over what you currently have.