pythonweb-scrapingurllib

Scraping from Json using beautifulsoup and urllib


I'm learning some scraping on a sample website that uses json. For instance, take the following sample website: http://www.charitystars.com/product/juve-chelsea-3-0-champions-league-jersey-autographed-by-giorgio-chiellini. The source code is here view-source:https://www.charitystars.com/product/juve-chelsea-3-0-champions-league-jersey-autographed-by-giorgio-chiellini. I would like to get the information at lines 388-396:

<script>
    var js_data = {"first_time_bid":true,"yourbid":0,"product":{"id":55,"item_number":"P55","type":"PRODUCT","fixed":0,"price":1000,"tot_price":1000,"min_bid_value":1010,"currency":"EUR","raise_bid":10,"stamp_end":"2013-06-14 12:00:00","bids_number":12,"estimated_value":200,"extended_time":0,"url":"https:\/\/www.charitystars.com\/product\/juve-chelsea-3-0-champions-league-jersey-autographed-by-giorgio-chiellini","conversion_value":1,"eid":0,"user_has_bidded":false},"bid":{"id":323,"uid":126,"first_name":"Fabio","last_name":"Gastaldi","company_name":"","is_company":0,"title":"fab1","nationality":"IT","amount":1000,"max_amount":0,"table":"","stamp":1371166006,"real_stamp":"2013-06-14 01:26:46"}};
    var p_currency = '€';
    var conversion_value = '1';
    var merch_items = [];
    var gallery_items = [];

    var inside_gala = false;
</script>

and save each variable in quotes (i.e., "id", "item_number", "type", ...) in a variable with the same name.

So far I managed to run the following

import requests
from bs4 import BeautifulSoup
from urllib import urlopen
import re
import json
import time
import csv
from bs4 import BeautifulSoup as soup
from pandas import DataFrame

import urllib2
hdr = {"User-Agent": "My Agent"}

req = urllib2.Request(http://www.charitystars.com/product/juve-chelsea-3-0-champions-league-jersey-autographed-by-giorgio-chiellini)
response = urllib2.urlopen(req)

htmlSource = response.read()
soup = BeautifulSoup(htmlSource)

title = soup.find_all("span", {"itemprop": "name"}) # get the title

script_soup = soup.find_all("script")

For some reason, script_soup has a lot of information that I don't need. I believe that the part that I need is in script_soup[9], but I don't know how to access it (in an efficient way). I would really appreciate some help.


Solution

  • The data is indeed in script_soup[9]. The issue is that this is a json string hardcoded in a script tag. You can get the string in plaintext with script_soup[9].string and then extract the json string with split() (as in my example) or with regex. Then load the string as a python dictionary with json.loads().

    import requests
    from bs4 import BeautifulSoup
    from pandas import DataFrame
    import json
    
    hdr = {"User-Agent": "My Agent"}
    response = requests.get("http://www.charitystars.com/product/juve-chelsea-3-0-champions-league-jersey-autographed-by-giorgio-chiellini", headers=hdr)
    
    soup = BeautifulSoup(response.content)
    script_soup = soup.find_all("script")
    data = json.loads(script_soup[9].string.split('= ')[1].split(';')[0])
    

    The data is now stored in the variable data. You can parse it as you like or load it in pandas with pd.DataFrame(data).