pythonweb-scrapingscraperwiki

Scrape a Google Chart script with Scraperwiki (Python)


I'm just getting into scraping with Scraperwiki in Python. Already figured out how to scrape tables from a page, run the scraper every month and save the results on top of each other. Pretty cool.

Now I want to scrape this page with information on Android versions and run the script monthly. In particular, I want the table for the version, codename, API and distribution. It's not easy.

The table is called with a wrapper div. Is there any way to scrape this information? I can't find any solution.

Plan B is to scrape the visualisation. What I eventually need, is the codename and the percentage, so that's sufficient. This information can be found in the HTML in a Google Chart script.

Google Chart API script

But I can't find this information with my 'souped' HTML. I have a public scraper over here. You can edit it to make it work.

Can anyone explain how I can approach this problem? A working scraper with comments on what's going on would be awesome.


Solution

  • This is really a difficult case, because as kisamoto mentioned, the data is inside the embedded JavaScript and not in a seperate JSON file as you would expect. It is possible with BeautifulSoup but it involes some ugly string processing:

    last_paragraph = soup.find_all('p', style='clear:both')[-1]
    script_tag = last_paragraph.next_sibling.next_sibling
    script_text = script_tag.text
    
    lines = script_text.split('\n')
    data_text = ''
    for line in lines:
    
        if 'SCREEN_DATA' in line: break
        data_text = data_text + line
    
    data_text = data_text.replace('var VERSION_DATA =', '')
    # delete semicolon at the end
    data_text = data_text[:-1]
    
    data = json.loads(data_text)
    data = data[0]
    print data['data']
    

    Output:

    [{u'perc': u'0.1', u'api': 4, u'name': u'Donut'}, ... ]