from playwright.sync_api import Playwright, sync_playwright
with sync_playwright() as playwright:
chromium = playwright.chromium
browser = chromium.launch()
context = browser.new_context()
page = context.new_page()
page.goto("https://www.ebay.com/deals/tech/ipads-tablets-ereaders")
button = page.locator("button.load-more-btn.btn.btn--secondary")
try:
while button:
button.scroll_into_view_if_needed()
button.click()
except:
pass
items = page.locator("div.dne-itemtile.dne-itemtile-large").all()
for item in items:
print(item.locator("img").get_attribute("src"))
print(item.locator("span.first").text_content())
print(item.locator("span.ebayui-ellipsis-2").text_content())
print()
print(len(items), "items")
I am trying to scrape eBay deals.
In my try block, with headless = False
, I would see the browser click the button to show me until this is no more button but the code will not scrape all the items but maybe the first 4 pages max.
On eBay's deal there can be more than 800 items, but I would be able to scrape the first 96
In short, when you click (or scroll down), the server sends a request (you can view it in developer mode) to retrieve deals. You can obtain deals using only requests, without worrying about Playwright or Selenium.
Example:
import time
import json
import requests
from bs4 import BeautifulSoup
LISTINGS_URL = "https://www.ebay.com/deals/spoke/ajax/listings"
TIMEZONE_OFFSET = 63072000
def get_dp1():
current_time = hex(int(time.time()) + TIMEZONE_OFFSET)[2:]
return f"bbl/DE{current_time}^"
def parse_deals(content):
soup = BeautifulSoup(content, "lxml")
items = []
for el in soup.select("div[data-listing-id]"):
image = el.select_one("img").get("src")
price = el.select_one("span.first").text
title = el.select_one("span.ebayui-ellipsis-2").text
items.append({"title": title, "price": price, "image": image})
return items
items = []
with requests.Session() as session:
session.cookies.set("dp1", get_dp1())
params = {"_ofs": 0, "category_path_seo": "tech,ipads-tablets-ereaders"}
while True:
print(f"Total: {len(items):<5} | Offset: {params['_ofs']}")
response = session.get(LISTINGS_URL, params=params)
data = response.json().get("fulfillmentValue", {})
params = data.get("pagination", {}).get("params")
if not params:
break
ditems = parse_deals(data["listingsHtml"])
items.extend(ditems)
with open("data.json", "w") as f:
json.dump(items, f, ensure_ascii=False, indent=2)
Output:
[
{
"title": "Samsung Galaxy Tab A9+ 11.0\" 64GB Gray Wi-Fi Tablet Bundle SM-X210NZAYXAR 2023",
"price": "$139.99",
"image": "https://i.ebayimg.com/images/g/qbUAAOSw1o1l1Rtt/s-l300.jpg"
},
...
]
To obtain deals, as mentioned earlier, the server sends a GET request with a mandatory cookie dp1
, which represents the current Unix time (for example, bbl/DE6a9839a1^
). Here, bbl/DE
and ^
are constant values (as I understand it), and between them is the current Unix time in hexadecimal format.
You may need to adjust the Unix time offset, as when you access the site, it sends the value of the cookie
dp1
relative to its own timezone.
After that, the server responds with a JSON object that contains all the necessary information for scraping.