pythonweb-scrapingplaywrightplaywright-python

Playwright - scraping eBay deals


from playwright.sync_api import Playwright, sync_playwright

with sync_playwright() as playwright:
    chromium = playwright.chromium
    browser = chromium.launch()
    context = browser.new_context()
    page = context.new_page()
    page.goto("https://www.ebay.com/deals/tech/ipads-tablets-ereaders")
    button = page.locator("button.load-more-btn.btn.btn--secondary")
    try:
        while button:
            button.scroll_into_view_if_needed()
            button.click()
    except:
        pass
        items = page.locator("div.dne-itemtile.dne-itemtile-large").all()
        for item in items:
            print(item.locator("img").get_attribute("src"))
            print(item.locator("span.first").text_content())
            print(item.locator("span.ebayui-ellipsis-2").text_content())
            print()
        print(len(items), "items")

I am trying to scrape eBay deals.
In my try block, with headless = False, I would see the browser click the button to show me until this is no more button but the code will not scrape all the items but maybe the first 4 pages max.

On eBay's deal there can be more than 800 items, but I would be able to scrape the first 96


Solution

  • In short, when you click (or scroll down), the server sends a request (you can view it in developer mode) to retrieve deals. You can obtain deals using only requests, without worrying about Playwright or Selenium.

    Example:

    import time
    import json
    import requests
    from bs4 import BeautifulSoup
    
    LISTINGS_URL = "https://www.ebay.com/deals/spoke/ajax/listings"
    TIMEZONE_OFFSET = 63072000
    
    def get_dp1():
        current_time = hex(int(time.time()) + TIMEZONE_OFFSET)[2:]
        return f"bbl/DE{current_time}^"
    
    def parse_deals(content):
        soup = BeautifulSoup(content, "lxml")
        items = []
        for el in soup.select("div[data-listing-id]"):
            image = el.select_one("img").get("src")
            price = el.select_one("span.first").text
            title = el.select_one("span.ebayui-ellipsis-2").text
            items.append({"title": title, "price": price, "image": image})
        return items
    
    items = []
    
    with requests.Session() as session:
        session.cookies.set("dp1", get_dp1())
        params = {"_ofs": 0, "category_path_seo": "tech,ipads-tablets-ereaders"}
        while True:
            print(f"Total: {len(items):<5} | Offset: {params['_ofs']}")
            response = session.get(LISTINGS_URL, params=params)
            data = response.json().get("fulfillmentValue", {})
            params = data.get("pagination", {}).get("params")
            if not params:
                break
            ditems = parse_deals(data["listingsHtml"])
            items.extend(ditems)
    
    with open("data.json", "w") as f:
        json.dump(items, f, ensure_ascii=False, indent=2)
    

    Output:

    [
      {
        "title": "Samsung Galaxy Tab A9+ 11.0\" 64GB Gray Wi-Fi Tablet Bundle SM-X210NZAYXAR 2023",
        "price": "$139.99",
        "image": "https://i.ebayimg.com/images/g/qbUAAOSw1o1l1Rtt/s-l300.jpg"
      },
      ...
    ]
    

    To obtain deals, as mentioned earlier, the server sends a GET request with a mandatory cookie dp1, which represents the current Unix time (for example, bbl/DE6a9839a1^). Here, bbl/DE and ^ are constant values (as I understand it), and between them is the current Unix time in hexadecimal format.

    You may need to adjust the Unix time offset, as when you access the site, it sends the value of the cookie dp1 relative to its own timezone.

    After that, the server responds with a JSON object that contains all the necessary information for scraping.