pythonweb-scrapinglxml

Scraping an embed element using lxml.html, or how to trick a website into thinking you have Flash installed


I'm attempting to scrape a website and I need to get at an embed element, but because I'm using Python and lxml.html the website accurately concludes that I do not have Flash installed and instead of showing me the embed element, it shows me this:

<div>
    <font>
        <u>
            <b>
                <a href="http://get.adobe.com/flashplayer/">
                ATTENTION:<br>This video will not play. You currently do not have Adobe Flash installed on this computer. Please click here to download it (it's free!)
                </a>
            </b>
        </u>
    </font>
</div>

Obviously that is a problem, so I'm wondering if it is at all possible to trick the browser into thinking you have Flash installed even though you don't, for the purposes of retrieving the right element?

I hope someone can help!


Solution

  • I believe the following blog post answers your question well. The author had the same need, to scrape Flash content using Python. And the same problem came up. He realized that he just needed to instantiate a browser (even just an in-memory one that did not even display to the screen) and then scrape its output. I think this could be a successful approach for what you need, and he makes it easy to understand.

    http://blog.motane.lu/2009/06/18/pywebkitgtk-execute-javascript-from-python/