pythonselenium-webdriverscreen-scraping

Scraping presidential speeches with Selenium


I have been scraping with BS4 but never used Selenium before. Now I think I do. I created a list of links to old presidential speeches in the government's website - this is an example. (That's public information by legal definition, and should not be that blocked.)

BS4 gives me this output:

<noscript>Please enable JavaScript to view the page content.
<br/>Your support ID is: XXXXXXXXXXXX.</noscript>

I know I should use Selenium, but I have been failing even to configure the driver.

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome()
driver.get('http://www.biblioteca.presidencia.gov.br/presidencia/ex-presidentes/michel-temer/discursos-do-presidente-da-republica/discurso-do-presidente-da-republica-michel-temer-durante-sessao-solene-de-abertura-xxi-marcha-a-brasilia-em-defesa-dos-municipios-palacio-do-planalto');
time.sleep(5)
driver.quit()

This is the output I get:

SessionNotCreatedException: Message: session not created: Chrome failed to start: exited normally.
  (session not created: DevToolsActivePort file doesn't exist)

Can you help me get it right?

SOLVED: JeffC found the problem. I wasn't giving the code enough time to load the js.

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

url = 'http://www.biblioteca.presidencia.gov.br/presidencia/ex-presidentes/michel-temer/discursos-do-presidente-da-republica/discurso-do-presidente-da-republica-michel-temer-durante-sessao-solene-de-abertura-xxi-marcha-a-brasilia-em-defesa-dos-municipios-palacio-do-planalto'
driver = webdriver.Chrome()
driver.maximize_window()
driver.get(url)

wait = WebDriverWait(driver, 10)
title = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "h1"))).text
print(title)

Solution

  • The code below should work. Make sure you have the most up-to-date Selenium packages. You no longer need to use an external driver manager because Selenium has one built in.

    from selenium import webdriver
    from selenium.webdriver.support.wait import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
    url = 'http://www.biblioteca.presidencia.gov.br/presidencia/ex-presidentes/michel-temer/discursos-do-presidente-da-republica/discurso-do-presidente-da-republica-michel-temer-durante-sessao-solene-de-abertura-xxi-marcha-a-brasilia-em-defesa-dos-municipios-palacio-do-planalto'
    driver = webdriver.Chrome()
    driver.maximize_window()
    driver.get(url)
    
    wait = WebDriverWait(driver, 10)
    title = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "h1"))).text
    print(title)
    

    This prints

    22-05-2018-Discurso do Presidente da República, Michel Temer, durante Sessão Solene de Abertura XXI Marcha a Brasília em Defesa dos Municípios - Palácio do Planalto

    If this doesn't work, you've got something off in your setup and you probably need to check that or set it up again.