I am trying to automate the scraping of a site with "infinite scroll" with Python and Playwright.
The issue is that Playwright doesn't include, as of yet, a scroll functionnality let alone an infinite auto-scroll functionnality.
From what I found on the net and my personnal testing, I can automate an infinite or finite scroll using the page.evaluate()
function and some Javascript code.
For example, this works:
for i in range(20):
page.evaluate('var div = document.getElementsByClassName("comment-container")[0];div.scrollTop = div.scrollHeight')
page.wait_for_timeout(500)
The problem with this approach is that it will either work by specifying a number of scrolls or by telling it to keep going forever with a while True
loop.
I need to find a way to tell it to keep scrolling until the final content loads.
This is the Javascript that I am currently trying in page.evaluate()
:
var intervalID = setInterval(function() {
var scrollingElement = (document.scrollingElement || document.body);
scrollingElement.scrollTop = scrollingElement.scrollHeight;
console.log('fail')
}, 1000);
var anotherID = setInterval(function() {
if ((window.innerHeight + window.scrollY) >= document.body.offsetHeight) {
clearInterval(intervalID);
}}, 1000)
This does not work either in my firefox browser or in the Playwright firefox browser. It returns immediately and doesn't execute the code in intervals.
I would be grateful if someone could tell me how I can, using Playwright, create an auto-scroll function that will detect and stop when it reaches the bottom of a dynamically loading webpage.
The new Playwright version has a scroll function. it's called mouse.wheel(x, y)
. In the below code, we'll be attempting to scroll through youtube.com which has an "infinite scroll":
from playwright.sync_api import Playwright, sync_playwright
import time
def run(playwright: Playwright) -> None:
browser = playwright.chromium.launch(headless=False)
context = browser.new_context()
# Open new page
page = context.new_page()
page.goto('https://www.youtube.com/')
# page.mouse.wheel(horizontally, vertically(positive is
# scrolling down, negative is scrolling up)
for i in range(5): #make the range as long as needed
page.mouse.wheel(0, 15000)
time.sleep(2)
time.sleep(15)
# ---------------------
context.close()
browser.close()
with sync_playwright() as playwright:
run(playwright)