pythonloopsselenium-webdriverfindelement

driver doesn't close after not finding an element, python


could you find an error in my code? I haven't been able to get over this code for a week now, so I am forced to ask a community. I am trying to download 14.000 html pages into a folder (I use selenium), I have a long list of ids that I paste into a webpage address. Because the website I am downloading htmls from is protected with captcha, I am using a proxy (first, I scrap free proxies from an online source and try to find a working one - when proxy fails I am telling my driver to close). The problem I am facing is the following:

  1. using a working driver (with a working proxy credentials) for every id in my list, I get the page. (works fine)
  2. I am inspecting the page for a table - if it is there, I can download it, if driver.get returns me a captcha I want to close the driver. BUT IT DOES NOT CLOSE. For whatever reason, selenium is perfectly fine downloading pages with no captcha, but when it gets captcha it just doesn't do anything! As if PyCharm is stuck. I am confused. The code part with proxies and finding a workable driver is okay, I just think the error is in the last lines of my code. Please see the code below:
#function to find an element. returns 1 if it finds and 0 if not

def find_element(driver, test_xpath = 'restab') -> int:
    if driver.find_elements(By.ID, test_xpath):
        var = 1
    else:
        var = 0
    return var

#function to download pages if the element is found and close driver if element is not located

def data_fill(id_list: str, driver) -> int:
    for id in id_list:
        author_page = "https://www.elibrary.ru/author_profile_new_titles.asp?id={}".format(id)
        driver.implicitly_wait(300)
        driver.get(author_page)
        result = find_element(driver)
        if result == 0:
            driver.close()
        else:
            n = os.path.join(f"/Users/dariagerashchenko/PycharmProjects/python_practice/hist/j_profile{id}.html")
            f = codecs.open(n, "w", "utf−8")
            h = driver.page_source
            f.write(h)
    return 1

# calling a function to get the code running
k = 0
while True:

    if k % 5 == 0:
        proxy_list = get_proxies()
    k += 1
    driver = get_best_driver(driver_path = driver_path, proxy_list = proxy_list) # find the working driver
    if driver is None:
        continue
    session_result = data_fill(id_list = id_list, driver=driver)
    if session_result == 1:  # data is collected
        print("Data collected.")

I tried multiple constellations to tell the driver to close, but failed many times. Previously I worked in R, and just recently switched to python, so maybe it is just my lack of knowledge.


Solution

  • Thanks to Nikhil Devadiga for his ideas, eventually I found an answer myself. Here it is:

    k = 0
    while True:
    if k % 5 == 0:
        proxy_list = get_proxies()
    k += 1
    driver = get_best_driver(driver_path = driver_path, proxy_list = proxy_list)
    for id in id_list:
        session_result = data_fill(id_list = id, driver=driver)
        if session_result == 0:
            driver.close()
            break
        continue
    print('done')
    

    But before I modified another part of my code:

    def data_fill(id_list: str, driver) -> int:
        author_page = "https://www.elibrary.ru/author_profile_new_titles.asp?id={}".format(id_list)
        driver.get(author_page)
        result = find_element(driver)
        if result == 0:
            output = 0
        else:
            n = os.path.join(f"/Users/dariagerashchenko/PycharmProjects/python_practice/hist/j_profile{id_list}.html")
            f = codecs.open(n, "w", "utf−8")
            h = driver.page_source
            f.write(h)
            output = 1
        return output