I'm new to Selenium and webscraping at all, and now I'm having trouble with captchas.
I'm trying to do the proccedures commented in this link:
Selenium downloading different captcha image than the one in browser
But it's not going well.
First Problem
My first problem is about the xpath selector. First, I have tried this code:
from selenium import webdriver
import urllib.request
driver = webdriver.Chrome()
driver.get("http://sistemas.cvm.gov.br/?fundosreg")
# Change frame.
driver.switch_to.frame("Main")
# Download image/captcha.
img = driver.find_element_by_xpath(".//*img[2]")
src = img.get_attribute('src')
urllib.request.urlretrieve(src, "captcha.jpeg")
Basically, I only changed the link. But I don't know if xpath is correctly written, and how can I write it. Using [2]
inside the "" sounds good, and it was used this way in the link I mentioned, but it doesn't work when I try to replicate it in a response.xpath in a scrapy shell session: response.xpath(".//img[2]")
. Has to be this way: response.xpath(".//img")[2]
The captcha in my link is hard to catch because the corresponding img tag doesn't have any id or class or anything else. Also, it is a .asp format, and I do not know what I can do about it.
Second Problem Then, I have tried this code, which also appeared in other similar searchs
from PIL import Image
from selenium import webdriver
def get_captcha(driver, element, path):
# now that we have the preliminary stuff out of the way time to get that image :D
location = element.location
size = element.size
# saves screenshot of entire page
driver.save_screenshot(path)
# uses PIL library to open image in memory
image = Image.open(path)
left = location['x']
top = location['y'] + 140
right = location['x'] + size['width']
bottom = location['y'] + size['height'] + 140
image = image.crop((left, top, right, bottom)) # defines crop points
image.save(path, 'png') # saves new cropped image
driver = webdriver.Chrome()
driver.get("http://preco.anp.gov.br/include/Resumo_Por_Estado_Index.asp")
# change frame
driver.switch_to.frame("Main")
# download image/captcha
#img = driver.find_element_by_xpath(".//*[@id='trRandom3']/td[2]/img")
img = driver.find_element_by_xpath(".//*img[2]")
get_captcha(driver, img, "captcha.png")
Again, I'm having problems with xpath, but there is another problem:
Traceback (most recent call last):
File "seletest2.py", line 27, in <module>
driver.switch_to.frame("Main")
File "/home/seiji/crawlers_env/lib/python3.6/site-packages/selenium/webdriver/remote/switch_to.py", line 87, in frame
raise NoSuchFrameException(frame_reference)
selenium.common.exceptions.NoSuchFrameException: Message: Main
The problem is in this line: driver.switch_to.frame("Main")
What does it mean?
Thank you!
Use WebDriverWait
to wait the element, utilize the method .frame_to_be_available_and_switch_to_it
to switch the iframe
Try the bellow code:
driver.get("http://sistemas.cvm.gov.br/?fundosreg")
WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.NAME, 'Main')))
img = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CSS_SELECTOR, '#Table1 img')))
src = img.get_attribute('src')
urllib.request.urlretrieve(src, "captcha.jpeg")
You need following import:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
But your other url is : http://preco.anp.gov.br/include/Resumo_Por_Estado_Index.asp, the captcha element is not in iframe
. This is the selector:
By.CSS_SELECTOR : table img
Please implement it with the above code.