pythonselenium-webdriverseleniumwire

Can't find video tag


I am trying to fetch tag url-content from the HTML code of site https://95jun.kinoxor.pro/984-univer-13-let-spustja-2024-07-06-19-54.html

The site is triky. You can open it from this page (the first/second result of the search engine https://yandex.ru/search/?text=https%3A%2F%2Fkinokubok.pro%2F232-univer-13-let-spustja-2024-06-25-19-51.html&lr=21653

I am looking for this URL: <iframe src="https://api.stiven-king.com/storage.html" ...

Proof that URL exists: enter image description here

How can I fetch html tag's content?

My code:

import seleniumwire.undetected_chromedriver as uc
import time

options = uc.ChromeOptions()
options.add_argument('--ignore-ssl-errors=yes')
options.add_argument('--ignore-certificate-errors')

driver = uc.Chrome(options=options)

def interceptor(request):
    del request.headers['Referer'] 
    request.headers['Referer'] = 'https://yandex.ru/'

url = "https://125jun.kinoamor.pro/251-univer-13-let-spustja-2024-06-27-19-51.html"

driver.request_interceptor = interceptor
driver.get(url)

time.sleep(3)
iframe_tag_elements = driver.find_elements("xpath", "//iframe")
print(f"FOUND VIDEO TAGS: {len(iframe_tag_elements)}") # prints 7
for iframe_elem in iframe_tag_elements:
    video_url = iframe_elem.get_attribute("src")
    if video_url:
        print("XXX_ ", video_url)

**PROBLEM ** - URL "https://api.stiven-king.com/storage.html" is not printed Also I don't see the URL the the driver.page_source

I was trying to sleep, to scroll page but it didn't help

Also was. trying to driver.switch_to.frame(iframe_elem) and the was serching for iframes againg


Solution

  • As suggested in the other answers you need to switch to the <iframe> containing the link you are looking for. But instead of looking for the first <iframe> you can provide more specific locator

    # replaced the url
    url = "https://01jul.kinokubok.pro/232-univer-13-let-spustja-2024-07-03-20-19.html"
    
    driver.request_interceptor = interceptor
    driver.get(url)
    
    WebDriverWait(driver, 30).until(ec.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR, "#dle-content .video-box > iframe:not([src])")))
    iframe_tag_elements = driver.find_elements("xpath", "//iframe")
    print(f"FOUND VIDEO TAGS: {len(iframe_tag_elements)}")
    for iframe_elem in iframe_tag_elements:
        video_url = iframe_elem.get_attribute("src")
        if video_url:
            print("XXX_ ", video_url)
    

    Output

    FOUND VIDEO TAGS: 1
    XXX_  https://api.stiven-king.com/storage.html
    

    If you want all the <iframe>s values you can build a recursive function to extract it.

    To make the page loading faster you can set page_load_strategy to 'eager', but be aware you might have to add some wait if it's too fast

    Complete code

    from selenium.webdriver.support.wait import WebDriverWait
    from selenium.webdriver.support import expected_conditions as ec
    import seleniumwire.undetected_chromedriver as uc
    
    
    options = uc.ChromeOptions()
    options.add_argument('--ignore-ssl-errors=yes')
    options.add_argument('--ignore-certificate-errors')
    options.page_load_strategy = 'eager'
    
    driver = uc.Chrome(options=options)
    
    
    def interceptor(request):
        del request.headers['Referer']
        request.headers['Referer'] = 'https://yandex.ru/'
    
    
    def get_frame_data(frames):
        src = []
        for frame in frames:
            video_url = frame.get_attribute("src")
            if video_url:
                src.append(video_url)
    
            driver.switch_to.frame(frame)
            child_frames = driver.find_elements("xpath", "//iframe")
            if child_frames:
                src.extend(get_frame_data(child_frames))
            driver.switch_to.default_content()
    
        return src
    
    
    url = "https://01jul.kinokubok.pro/232-univer-13-let-spustja-2024-07-03-20-19.html"
    
    driver.request_interceptor = interceptor
    driver.get(url)
    
    wait = WebDriverWait(driver, 10)
    wait.until(ec.visibility_of_element_located(("id", "grid")))
    wait.until(ec.visibility_of_element_located(("class name", "karusel")))
    
    iframe_tag_elements = driver.find_elements("xpath", "//iframe")
    all_src = get_frame_data(iframe_tag_elements)
    for sr in all_src:
        print("XXX_ ", sr)
    

    Output 1:

    XXX_  https://api.marts.ws/embed/movie/74360
    XXX_  https://loosening-as.allarknow.online/?token_movie=be2b9578d8cae35323bb199f888be1&token=b5c08f668c592ee23d32031d27de44
    XXX_  https://www.youtube.com/embed/mthO33phh9U
    XXX_  https://yastatic.net/share2/v-1.16.0/frame.html?namespace=ya-share2.0.7255632935282506
    XXX_  https://yastatic.net/share2/v-1.16.0/frame.html?namespace=ya-share2.0.027842698862642123
    

    Output 2:

    XXX_  https://api.stiven-king.com/storage.html
    XXX_  https://loosening-as.allarknow.online/?token_movie=be2b9578d8cae35323bb199f888be1&token=b5c08f668c592ee23d32031d27de44
    XXX_  https://www.youtube.com/embed/mthO33phh9U
    XXX_  https://yastatic.net/share2/v-1.16.0/frame.html?namespace=ya-share2.0.9638742292394189
    XXX_  https://yastatic.net/share2/v-1.16.0/frame.html?namespace=ya-share2.0.8570699995377056