python-3.xpandasweb-scrapingpopuppyppeteer

How to handle privacy information popup? I am using Python and pyppeteer library for webscrapping


I have a working scrapper, but I have trouble closing the pop up. And the pop up only comes in certain cases, so I need to handle it popup

I have tried finding a button attribute and click "Accept All"

the bold portion in the code is what I have tried

import asyncio
from pyppeteer import launch
import time
from datetime import datetime, timedelta
import pandas as pd


async def filter_by_url(url):

    browser = await launch(
        {
            "headless": False,
            'args':['--start-maximized'],
            # 'executablePath':'/usr/bin/google-chrome'
        }
    )
    # url = "https://www.justwatch.com/us/provider/netflix?sort_by=trending_7_day"
    page = await browser.newPage()
    await page.setViewport({'width': 1920, 'height': 1080})
    await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3')
    await page.goto(url)   
    ## Scroll To Bottom
    **#time.sleep(5)

    #await page.waitFor('footer span[data-icon="Accept all"]')
    #await page.click('button:has-text("Accept all")');
    #await Page.locator('uc-accept-all-button').first().click();**
    while True:
        
        await page.evaluate("""{window.scrollBy(0, document.body.scrollHeight);}""")
        time.sleep(2)
        end_point = await page.querySelector(".timeline__end-of-timeline")
        if end_point:
            print("reached to end points")
            break      

    

# Run the function
urls = [
    'https://www.justwatch.com/ca/provider/netflix?sort_by=trending_7_day'
]
for url in urls:
    asyncio.get_event_loop().run_until_complete(filter_by_url(url))


Solution

  • Your button is placed inside shadow-root, to get internal shadow root structure, you should get it's host first and then get shadowRoot property.

    Shadow host has selector #usercentrics-root. You should wait for host content to be loaded and then click internal button. If content has not been rendered yet - repeat with timeout.

    After that good practice to wait for host to be hidden.

    More about Shadow DOM

      await page.evaluate("""function acceptConsent() {
           let accept = document.querySelector('#usercentrics-root').shadowRoot.querySelector('[data-testid=uc-accept-all-button]');
           if(accept) {
             accept.click();
             return;
           }
           setTimeout(acceptConsent, 500);
           }
        """)
       await page.waitForSelector('#usercentrics-root', options={'visible': False})