pythonunicodepython-unicodescraperwiki

Scraperwiki character encoding anomaly


Here is a ScraperWiki scraper written in Python:

import lxml.html
import scraperwiki
from unidecode import unidecode

html = scraperwiki.scrape("http://www.timeshighereducation.co.uk/world-university-rankings/2012-13/world-ranking/range/001-200")
root = lxml.html.fromstring(html)
for tr in root.cssselect("table.ranking tr"):
    if len(tr.cssselect("td.rank")) > 0 and len(tr.cssselect("td.uni")) > 0:
        university = unidecode(tr.cssselect("td.uni")[0].text_content()).strip().title()
        if 'cole' in university:
            print university

It produces the following output:

Ecole Polytechnique Federale De Lausanne
Ecole Normale Superieure
Acole Polytechnique
Ecole Normale Superieure De Lyon

My question: what is causing the initial character on the third output line to be rendered as "A" rather than as "E", and how can I stop this from happening?


Solution

  • Based on soulseekah's helpful comment above, and on the lxml docs here and here, the following solution works:

    import lxml.html
    import scraperwiki
    from unidecode import unidecode
    from BeautifulSoup import UnicodeDammit
    
    def decode_html(html_string):
        converted = UnicodeDammit(html_string, isHTML=True)
        if not converted.unicode:
            raise UnicodeDecodeError(
                "Failed to detect encoding, tried [%s]",
                ', '.join(converted.triedEncodings))
        return converted.unicode
    
    html = scraperwiki.scrape("http://www.timeshighereducation.co.uk/world-university-rankings/2012-13/world-ranking/range/001-200")
    root = lxml.html.fromstring(decode_html(html))
    for tr in root.cssselect("table.ranking tr"):
        if len(tr.cssselect("td.rank")) > 0 and len(tr.cssselect("td.uni")) > 0:
            university = unidecode(tr.cssselect("td.uni")[0].text_content()).strip().title()
            if 'cole' in university:
                print university