I'm just getting into scraping with Scraperwiki in Python. Already figured out how to scrape tables from a page, run the scraper every month and save the results on top of each other. Pretty cool.
Now I want to scrape this page with information on Android versions and run the script monthly. In particular, I want the table for the version, codename, API and distribution. It's not easy.
The table is called with a wrapper div. Is there any way to scrape this information? I can't find any solution.
Plan B is to scrape the visualisation. What I eventually need, is the codename and the percentage, so that's sufficient. This information can be found in the HTML in a Google Chart script.
But I can't find this information with my 'souped' HTML. I have a public scraper over here. You can edit it to make it work.
Can anyone explain how I can approach this problem? A working scraper with comments on what's going on would be awesome.
This is really a difficult case, because as kisamoto mentioned, the data is inside the embedded JavaScript and not in a seperate JSON file as you would expect. It is possible with BeautifulSoup but it involes some ugly string processing:
last_paragraph = soup.find_all('p', style='clear:both')[-1]
script_tag = last_paragraph.next_sibling.next_sibling
script_text = script_tag.text
lines = script_text.split('\n')
data_text = ''
for line in lines:
if 'SCREEN_DATA' in line: break
data_text = data_text + line
data_text = data_text.replace('var VERSION_DATA =', '')
# delete semicolon at the end
data_text = data_text[:-1]
data = json.loads(data_text)
data = data[0]
print data['data']
Output:
[{u'perc': u'0.1', u'api': 4, u'name': u'Donut'}, ... ]