pythonunicodeurllib2scraperwiki

Unicode issue with Python scraper


I've been writing bad perl for a while, but am attempting to learn to write bad python instead. I've read around the problem I've been having for a couple of days now (and know an awful lot more about unicode as a result) but I'm still having troubles with a rogue em-dash in the following code:

import urllib2

def scrape(url):
# simplified
    data = urllib2.urlopen(url)
    return data.read()

def query_graph_api(url_list):
# query Facebook's Graph API, store data.
    for url in url_list:
        graph_query = graph_query_root + "%22" + url + "%22"
        query_data = scrape(graph_query)
        print query_data #debug console

### START HERE ####

graph_query_root = "https://graph.facebook.com/fql?q=SELECT%20normalized_url,share_count,like_count,comment_count,total_count%20FROM%20link_stat%20WHERE%20url="

url_list = ['http://www.supersavvyme.co.uk',  'http://www.supersavvyme.co.uk/article/how-to-be-happy–laugh-more']

query_graph_api(url_list)

(This is a much simplified representation of the scraper, BTW. The original uses a site's sitemap.xml to build a list of URLs, then queries Facebook's Graph API for information on each -- here's the original scraper)

My attempts to debug this have consisted mostly of trying to emulate the infinite monkeys who are rewriting Shakespeare. My usual method (search StackOverflow for the error message, copy-and-paste the solution) has failed.

Question: how do I encode my data so that extended characters like the em-dash in the second URL won't break my code, but will still work in the FQL query?

P.S. I'm even wondering whether I'm asking the right question: might urllib.urlencode help me out here (certainly it would make that graph_query_root easier and prettier to create...

---8<----

The traceback I get from the actual scraper on ScraperWiki is as follows:

http://www.supersavvyme.co.uk/article/how-to-be-happy–laugh-more
Line 80 - query_graph_api(urls)
Line 53 - query_data = scrape(graph_query) -- query_graph_api((urls=['http://www.supersavvyme.co.uk', 'http://...more
Line 21 - data = urllib2.urlopen(unicode(url)) -- scrape((url=u'https://graph.facebook.com/fql?q=SELECT%20url,...more
/usr/lib/python2.7/urllib2.py:126 -- urlopen((url=u'https://graph.facebook.com/fql?q=SELECT%20url,no...more
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 177: ordinal not in range(128)

Solution

  • If you are using Python 3.x, all you have to do is add one line and change another:

    gq = graph_query.encode('utf-8')
    query_data = scrape(gq)
    

    If you are using Python 2.x, first put the following line in at the top of the module file:

    # -*- coding: utf-8 -*- (read what this is for here)

    and then make all your string literals unicode and encode just before passing to urlopen:

    def scrape(url):
    # simplified
        data = urllib2.urlopen(url)
        return data.read()
    
    def query_graph_api(url_list):
    # query Facebook's Graph API, store data.
        for url in url_list:
            graph_query = graph_query_root + u"%22" + url + u"%22"
            gq = graph_query.encode('utf-8')
            query_data = scrape(gq)
            print query_data #debug console
    
    ### START HERE ####
    
    graph_query_root = u"https://graph.facebook.com/fql?q=SELECT%20normalized_url,share_count,like_count,comment_count,total_count%20FROM%20link_stat%20WHERE%20url="
    
    url_list = [u'http://www.supersavvyme.co.uk', u'http://www.supersavvyme.co.uk/article/how-to-be-happy–laugh-more']
    
    query_graph_api(url_list)
    

    It looks from the code like you are using 3.x, which is really better for dealing with stuff like this. But you still have to encode when necessary. In 2.x, the best advice is to do what 3.x does by default: use unicode throughout your code, and only encode when bytes are called for.