I am scraping forum post titles using the Firefox gecko driver with selenium in Python and have hit a snag that I can't seem to figure out.
~$ geckodriver --version
geckodriver 0.19.0
The source code of this program is available from
testing/geckodriver in https://hg.mozilla.org/mozilla-central.
This program is subject to the terms of the Mozilla Public License 2.0.
You can obtain a copy of the license at https://mozilla.org/MPL/2.0/.
I am trying to scrape a couple years worth of past post titles from the forum and my code works fine for a while. I've sat and watched it run for about 20-30 minutes and it does exactly what it is supposed to be doing. However then I kick the script off, and go to bed, and when I wake up the next morning I find that it's processed ~22,000 posts. The site I'm currently scraping has 25 posts per page, so it got through ~880 separate URL's before it crashes.
When it does crash it throws the following error:
WebDriverException: Message: Tried to run command without establishing a connection
Initially my code looked like this:
FirefoxProfile = webdriver.FirefoxProfile('/home/me/jupyter-notebooks/FirefoxProfile/')
firefox_capabilities = DesiredCapabilities.FIREFOX
firefox_capabilities['marionette'] = True
driver = webdriver.Firefox(FirefoxProfile, capabilities=firefox_capabilities)
for url in urls:
driver.get(url)
### code to process page here ###
driver.close()
I've also tried:
driver = webdriver.Firefox(FirefoxProfile, capabilities=firefox_capabilities)
for url in urls:
driver.get(url)
### code to process page here ###
driver.close()
and
for url in urls:
driver = webdriver.Firefox(FirefoxProfile, capabilities=firefox_capabilities)
driver.get(url)
### code to process page here ###
driver.close()
I get the same error in all 3 scenerios, but only after it's been running successfully for quite a while, and I'm not sure how to determine why it's failing.
How do I determine why I get this error after it's successfully processed several hundred url's? Or is there some sort of best practice I'm not following with Selenium/Firefox for processing this many pages?
All the 3 code blocks were near perfect but had minor flaws as follows:
Your first code block is :
driver = webdriver.Firefox(FirefoxProfile, capabilities=firefox_capabilities)
for url in urls:
driver.get(url)
### code to process page here ###
driver.close()
The code block looks pretty much promising sans one issue. In the last step as per the Best Practices
we must have invoked driver.quit()
instead of driver.close()
which would have prevented from the dangling webdriver
instances residing in the System Memory
. You can find the difference of driver.close()
& driver.quit()
here
.
Your second code block is :
driver = webdriver.Firefox(FirefoxProfile, capabilities=firefox_capabilities)
for url in urls:
driver.get(url)
### code to process page here ###
driver.close()
This block is error prone. Once the execution enters the for()
loop and works on an url
finally we are closing the Browser Session/Instance
. So when the execution starts the loop for the second iteration, the script errors on driver.get(url)
as there is no Active Browser Session
.
Your third code block is :
for url in urls:
driver = webdriver.Firefox(FirefoxProfile, capabilities=firefox_capabilities)
driver.get(url)
### code to process page here ###
driver.close()
The code block looks pretty much composed sans the same issue as the first code block. In the last step we must have invoked driver.quit()
instead of driver.close()
which would have prevented from the dangling webdriver
instances residing in the System Memory
. As the dangling webdriver
instances creates chores and keeps on occupying the ports at some point of time WebDriver
is unable to find a free port or unable to open up a new Browser Session/Connection
. Hence you see the error as WebDriverException: Message: Tried to run command without establishing a connection
As per Best Practices
try to invoke driver.quit()
instead of driver.close()
and open a new WebDriver
instance and a new Web Browser Session
.