pythonselenium-webdriverweb-scrapingcaptcha

Selenium get captcha image in browser


I'm new to Selenium and webscraping at all, and now I'm having trouble with captchas.

I'm trying to do the proccedures commented in this link:

Selenium downloading different captcha image than the one in browser

But it's not going well.

First Problem

My first problem is about the xpath selector. First, I have tried this code:

from selenium import webdriver
import urllib.request


driver = webdriver.Chrome()
driver.get("http://sistemas.cvm.gov.br/?fundosreg")

# Change frame.
driver.switch_to.frame("Main")


# Download image/captcha.
img = driver.find_element_by_xpath(".//*img[2]")
src = img.get_attribute('src')
urllib.request.urlretrieve(src, "captcha.jpeg")

Basically, I only changed the link. But I don't know if xpath is correctly written, and how can I write it. Using [2] inside the "" sounds good, and it was used this way in the link I mentioned, but it doesn't work when I try to replicate it in a response.xpath in a scrapy shell session: response.xpath(".//img[2]"). Has to be this way: response.xpath(".//img")[2]

The captcha in my link is hard to catch because the corresponding img tag doesn't have any id or class or anything else. Also, it is a .asp format, and I do not know what I can do about it.

Second Problem Then, I have tried this code, which also appeared in other similar searchs

from PIL import Image
from selenium import webdriver

def get_captcha(driver, element, path):
    # now that we have the preliminary stuff out of the way time to get that image :D
    location = element.location
    size = element.size
    # saves screenshot of entire page
    driver.save_screenshot(path)

    # uses PIL library to open image in memory
    image = Image.open(path)

    left = location['x']
    top = location['y'] + 140
    right = location['x'] + size['width']
    bottom = location['y'] + size['height'] + 140

    image = image.crop((left, top, right, bottom))  # defines crop points
    image.save(path, 'png')  # saves new cropped image


driver = webdriver.Chrome()
driver.get("http://preco.anp.gov.br/include/Resumo_Por_Estado_Index.asp")

# change frame
driver.switch_to.frame("Main")

# download image/captcha
#img = driver.find_element_by_xpath(".//*[@id='trRandom3']/td[2]/img")
img = driver.find_element_by_xpath(".//*img[2]")
get_captcha(driver, img, "captcha.png")

Again, I'm having problems with xpath, but there is another problem:

Traceback (most recent call last):
  File "seletest2.py", line 27, in <module>
    driver.switch_to.frame("Main")
  File "/home/seiji/crawlers_env/lib/python3.6/site-packages/selenium/webdriver/remote/switch_to.py", line 87, in frame
    raise NoSuchFrameException(frame_reference)
selenium.common.exceptions.NoSuchFrameException: Message: Main

The problem is in this line: driver.switch_to.frame("Main") What does it mean?

Thank you!


Solution

  • Use WebDriverWait to wait the element, utilize the method .frame_to_be_available_and_switch_to_it to switch the iframe

    Try the bellow code:

    driver.get("http://sistemas.cvm.gov.br/?fundosreg")
    WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.NAME, 'Main')))
    img = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.CSS_SELECTOR, '#Table1 img')))
    src = img.get_attribute('src')
    urllib.request.urlretrieve(src, "captcha.jpeg")
    

    You need following import:

    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    

    But your other url is : http://preco.anp.gov.br/include/Resumo_Por_Estado_Index.asp, the captcha element is not in iframe. This is the selector:

    By.CSS_SELECTOR : table img
    

    Please implement it with the above code.