I'm learning some scraping on a sample website that uses json. For instance, take the following sample website: http://www.charitystars.com/product/juve-chelsea-3-0-champions-league-jersey-autographed-by-giorgio-chiellini. The source code is here view-source:https://www.charitystars.com/product/juve-chelsea-3-0-champions-league-jersey-autographed-by-giorgio-chiellini
. I would like to get the information at lines 388-396:
<script>
var js_data = {"first_time_bid":true,"yourbid":0,"product":{"id":55,"item_number":"P55","type":"PRODUCT","fixed":0,"price":1000,"tot_price":1000,"min_bid_value":1010,"currency":"EUR","raise_bid":10,"stamp_end":"2013-06-14 12:00:00","bids_number":12,"estimated_value":200,"extended_time":0,"url":"https:\/\/www.charitystars.com\/product\/juve-chelsea-3-0-champions-league-jersey-autographed-by-giorgio-chiellini","conversion_value":1,"eid":0,"user_has_bidded":false},"bid":{"id":323,"uid":126,"first_name":"Fabio","last_name":"Gastaldi","company_name":"","is_company":0,"title":"fab1","nationality":"IT","amount":1000,"max_amount":0,"table":"","stamp":1371166006,"real_stamp":"2013-06-14 01:26:46"}};
var p_currency = '€';
var conversion_value = '1';
var merch_items = [];
var gallery_items = [];
var inside_gala = false;
</script>
and save each variable in quotes (i.e., "id", "item_number", "type", ...) in a variable with the same name.
So far I managed to run the following
import requests
from bs4 import BeautifulSoup
from urllib import urlopen
import re
import json
import time
import csv
from bs4 import BeautifulSoup as soup
from pandas import DataFrame
import urllib2
hdr = {"User-Agent": "My Agent"}
req = urllib2.Request(http://www.charitystars.com/product/juve-chelsea-3-0-champions-league-jersey-autographed-by-giorgio-chiellini)
response = urllib2.urlopen(req)
htmlSource = response.read()
soup = BeautifulSoup(htmlSource)
title = soup.find_all("span", {"itemprop": "name"}) # get the title
script_soup = soup.find_all("script")
For some reason, script_soup has a lot of information that I don't need. I believe that the part that I need is in script_soup[9]
, but I don't know how to access it (in an efficient way). I would really appreciate some help.
The data is indeed in script_soup[9]
. The issue is that this is a json
string hardcoded in a script tag. You can get the string in plaintext with script_soup[9].string
and then extract the json
string with split()
(as in my example) or with regex
. Then load the string as a python dictionary with json.loads()
.
import requests
from bs4 import BeautifulSoup
from pandas import DataFrame
import json
hdr = {"User-Agent": "My Agent"}
response = requests.get("http://www.charitystars.com/product/juve-chelsea-3-0-champions-league-jersey-autographed-by-giorgio-chiellini", headers=hdr)
soup = BeautifulSoup(response.content)
script_soup = soup.find_all("script")
data = json.loads(script_soup[9].string.split('= ')[1].split(';')[0])
The data is now stored in the variable data
. You can parse it as you like or load it in pandas
with pd.DataFrame(data)
.