I have a Unicode encode error with the following code for a simple web scraper.
print 'JSON scraper initializing'
from bs4 import BeautifulSoup
import json
import requests
import geocoder
# Set page variable
page = 'https://www.bandsintown.com/?came_from=257&page='
urlBucket = []
for i in range (1,3):
uniqueUrl = page + str(i)
urlBucket.append(uniqueUrl)
# Build response container
responseBucket = []
for i in urlBucket:
uniqueResponse = requests.get(i)
responseBucket.append(uniqueResponse)
# Build soup container
soupBucket = []
for i in responseBucket:
individualSoup = BeautifulSoup(i.text, 'html.parser')
soupBucket.append(individualSoup)
# Build events container
allSanFranciscoEvents = []
for i in soupBucket:
script = i.find_all("script")[4]
eventsJSON = json.loads(script.text)
allSanFranciscoEvents.append(eventsJSON)
with open("allSanFranciscoEvents.json", "w") as writeJSON:
json.dump(allSanFranciscoEvents, writeJSON, ensure_ascii=False)
print ('end')
The odd thing is the sometimes, this code works, and doesn't give an error. It has to do with the for i in range
line of the code. For example, if I put in (2,4)
for the range, it works fine. If I change it to 1,3,
it reads:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 12: ordinal not in range(128)
How can I fix this issue within my code? If I print allSanFranciscoEvents
, it is reading in all the data, so I believe the issue is happening in the final piece of code, with the JSON dump.
eventsJSON
is object it can't use eventsJSON.encode('utf-8')
. For Python 2.7 to write the file in utf-8
or unicode you can use codecs
or write it using binary or wb
flag.
with open("allSanFranciscoEvents.json", "wb") as writeJSON:
jsStr = json.dumps(allSanFranciscoEvents)
# the decode() needed because we need to convert it to binary
writeJSON.write(jsStr.decode('utf-8'))
print ('end')
# and read it normally
with open("allSanFranciscoEvents.json", "r") as readJson:
data = json.load(readJson)
print(data[0][0]["startDate"])
# 2019-02-04