pythonunicodebeautifulsoupencode

UnicodeEncodeError in BeautifulSoup webscraper


I have a Unicode encode error with the following code for a simple web scraper.

print 'JSON scraper initializing'

from bs4 import BeautifulSoup
import json
import requests
import geocoder


# Set page variable
page = 'https://www.bandsintown.com/?came_from=257&page='
urlBucket = []
for i in range (1,3):
    uniqueUrl = page + str(i)
    urlBucket.append(uniqueUrl)

# Build response container
responseBucket = []

for i in urlBucket:
    uniqueResponse = requests.get(i)
    responseBucket.append(uniqueResponse)


# Build soup container
soupBucket = []
for i in responseBucket:
    individualSoup = BeautifulSoup(i.text, 'html.parser')
    soupBucket.append(individualSoup)


# Build events container
allSanFranciscoEvents = []
for i in soupBucket:
    script = i.find_all("script")[4]
  
    eventsJSON = json.loads(script.text)
  
    allSanFranciscoEvents.append(eventsJSON)
 
   
with open("allSanFranciscoEvents.json", "w") as writeJSON:
   json.dump(allSanFranciscoEvents, writeJSON, ensure_ascii=False)
print ('end')

The odd thing is the sometimes, this code works, and doesn't give an error. It has to do with the for i in range line of the code. For example, if I put in (2,4) for the range, it works fine. If I change it to 1,3, it reads:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 12: ordinal not in range(128)

How can I fix this issue within my code? If I print allSanFranciscoEvents, it is reading in all the data, so I believe the issue is happening in the final piece of code, with the JSON dump.


Solution

  • eventsJSON is object it can't use eventsJSON.encode('utf-8'). For Python 2.7 to write the file in utf-8 or unicode you can use codecs or write it using binary or wb flag.

    with open("allSanFranciscoEvents.json", "wb") as writeJSON:
       jsStr = json.dumps(allSanFranciscoEvents)
       # the decode() needed because we need to convert it to binary
       writeJSON.write(jsStr.decode('utf-8')) 
    print ('end')
    
    # and read it normally
    with open("allSanFranciscoEvents.json", "r") as readJson:
        data = json.load(readJson)
        print(data[0][0]["startDate"])
        # 2019-02-04