pythonselenium-webdriver

I am trying to create a web scraper using python and the selenium library


As soon as i try to scrape a website it loads the browser in another instance but crashes immediately the code and error is attached ->

code:

import selenium.webdriver as webdriver
from selenium.webdriver.chrome.service import Service
import time


def scrape_website(website):
    print("Launching the browser!")
    
    option=webdriver.ChromeOptions()
    driver=webdriver.Chrome()
    try:
        driver.get(website)
        print("The page is loaded now...")
        html=driver.page_source
        time.sleep(10)
        return html
    finally:
        driver.quit()

The error:

InvalidArgumentException: Message: invalid argument (Session info: chrome=131.0.6778.205) 
Stacktrace: GetHandleVerifier [0x00007FF616866CC5+28821] (No symbol) [0x00007FF6167D3850] 
(No symbol) [0x00007FF6166755B9] (No symbol) [0x00007FF616663051] (No symbol) [0x00007FF6166612FD] (No symbol) [0x00007FF616661B3C] (No symbol) [0x00007FF61667885A] (No symbol) [0x00007FF6167101FE] (No symbol) [0x00007FF6166EF2FA] (No symbol) [0x00007FF61670F412] (No symbol) [0x00007FF6166EF0A3] (No symbol) [0x00007FF6166BA778] (No symbol) [0x00007FF6166BB8E1] GetHandleVerifier [0x00007FF616B9FCCD+3408029] GetHandleVerifier [0x00007FF616BB743F+3504143] GetHandleVerifier [0x00007FF616BAB61D+3455469] GetHandleVerifier [0x00007FF61692BDCB+835995] (No symbol) [0x00007FF6167DEB6F] (No symbol) [0x00007FF6167DA824] (No symbol) [0x00007FF6167DA9BD] (No symbol) [0x00007FF6167CA1A9] BaseThreadInitThunk [0x00007FF85F087374+20] RtlUserThreadStart [0x00007FF86057CC91+33]

I am using streamlit to prepare the frontend of the application the code is attached below:

import streamlit as st # type: ignore
from scrape import scrape_website

st.title("College Website Scraper")
url=st.text_input("Enter the Website Address:")

if st.button("Scrape Site"):
    st.write("Scraping this Website")
    result=scrape_website(url)
    print(result)

Solution

  • The URL passed to driver.get() needs to include the scheme - e.g., https

    The error you're seeing is due to the absence of that component of the URL.

    You can use urlparse from urllib.parse to check various aspects of a URL.

    Ignoring streamlit (because it's not relevant to the question) here's an example of how you could check that an input URL contains a scheme:

    import selenium.webdriver as webdriver
    from selenium.webdriver import ChromeOptions
    from urllib.parse import urlparse
    
    def scrape_website(website):
        options = ChromeOptions()
        options.add_argument("--headless=true")
        with webdriver.Chrome(options) as driver:
            driver.get(website)
            return driver.page_source
    
    
    while url := input("Enter url to scrape: "):
        p = urlparse(url)
        if not p.scheme:
            print("Scheme missing from url")
        else:
            html = scrape_website(url)
            print("HTML fragment:", html[:80])
    

    Example:

    Enter url to scrape: www.google.com
    Scheme missing from url
    Enter url to scrape: https://www.google.com
    HTML fragment: <html itemscope="" itemtype="http://schema.org/WebPage" lang="en-GB"><head><meta