pythonparsingseleniumurllib2scraperwiki

How to get selenium to work on scraperwiki


I love selenium and I love scraperwiki but somehow I cannot get them to work properly together. I've tried to open a website in two ways with selenium on scraperwiki, both methods have been gotten from tutorials:

import selenium
sel = selenium.selenium("localhost",4444,"*firefox", "http://www.google.com")   
sel.open("http://google.com")

This does not work. It gives me the following error:

error: [Errno 111] Connection refused 

And neither does this:

from selenium import webdriver 
browser = webdriver.Firefox()

Which gives another error:

/usr/lib/python2.7/subprocess.py:672 -- __init__((self=<subprocess.Popen object at 0x1d14410>, args=[None, '-silent'], bufsize=0, executable=None, stdin=None, stdout=-1, stderr=-1, preexec_fn=None, close_fds=False, shell=False, cwd=None, env=None, universal_newlines=False, startupinfo=None, creationflags=0))
AttributeError: 'NoneType' object has no attribute 'rfind'

Does anyone see a logical reason for this?

The docs on scraperwiki indicate that seleneium is "Only useful in ScraperWiki if you have a Selenium server to point it to." I don't know what they mean exactly with this but I recon it might be the cause of the problem. Any help would be greatly appreciated.


Solution

  • Selenium isn't just the Python library you are using, it also has a separate piece of software, the Selenium server mentioned in your question, which interacts with the browser on your code's behalf.

    This line of code

    sel = selenium.selenium("localhost",4444,"*firefox", "http://www.google.com")  
    

    is trying to connect to the Selenium server in order that it can send commands to the browser (Firefox) as your code asks it to. "localhost" in the context of your script is one of ScraperWiki's servers which is not running the selenium server.

    What you would need to do is download http://selenium.googlecode.com/files/selenium-server-standalone-2.28.0.jar and install it on a different server, running it with

    java -jar selenium-server-standalone-2.28.0.jar
    

    Then you could change your code to point to the server where you are running it. It gets more complicated because ScraperWiki restricts which ports you can connect out on to the internet with, so instead of using port 4444 you might need to use another one (probably port 80).

    All told, I think it probably isn't a workable solution, you might be better off just writing your scraper in python, php or ruby.