could you find an error in my code? I haven't been able to get over this code for a week now, so I am forced to ask a community. I am trying to download 14.000 html pages into a folder (I use selenium), I have a long list of ids that I paste into a webpage address. Because the website I am downloading htmls from is protected with captcha, I am using a proxy (first, I scrap free proxies from an online source and try to find a working one - when proxy fails I am telling my driver to close). The problem I am facing is the following:
#function to find an element. returns 1 if it finds and 0 if not
def find_element(driver, test_xpath = 'restab') -> int:
if driver.find_elements(By.ID, test_xpath):
var = 1
else:
var = 0
return var
#function to download pages if the element is found and close driver if element is not located
def data_fill(id_list: str, driver) -> int:
for id in id_list:
author_page = "https://www.elibrary.ru/author_profile_new_titles.asp?id={}".format(id)
driver.implicitly_wait(300)
driver.get(author_page)
result = find_element(driver)
if result == 0:
driver.close()
else:
n = os.path.join(f"/Users/dariagerashchenko/PycharmProjects/python_practice/hist/j_profile{id}.html")
f = codecs.open(n, "w", "utf−8")
h = driver.page_source
f.write(h)
return 1
# calling a function to get the code running
k = 0
while True:
if k % 5 == 0:
proxy_list = get_proxies()
k += 1
driver = get_best_driver(driver_path = driver_path, proxy_list = proxy_list) # find the working driver
if driver is None:
continue
session_result = data_fill(id_list = id_list, driver=driver)
if session_result == 1: # data is collected
print("Data collected.")
I tried multiple constellations to tell the driver to close, but failed many times. Previously I worked in R, and just recently switched to python, so maybe it is just my lack of knowledge.
Thanks to Nikhil Devadiga for his ideas, eventually I found an answer myself. Here it is:
k = 0
while True:
if k % 5 == 0:
proxy_list = get_proxies()
k += 1
driver = get_best_driver(driver_path = driver_path, proxy_list = proxy_list)
for id in id_list:
session_result = data_fill(id_list = id, driver=driver)
if session_result == 0:
driver.close()
break
continue
print('done')
But before I modified another part of my code:
def data_fill(id_list: str, driver) -> int:
author_page = "https://www.elibrary.ru/author_profile_new_titles.asp?id={}".format(id_list)
driver.get(author_page)
result = find_element(driver)
if result == 0:
output = 0
else:
n = os.path.join(f"/Users/dariagerashchenko/PycharmProjects/python_practice/hist/j_profile{id_list}.html")
f = codecs.open(n, "w", "utf−8")
h = driver.page_source
f.write(h)
output = 1
return output