pythonpandasunicode

What is this value that pandas is returning from a url?


I'm trying to read a table from an URL with pandas but it's returning some weird value for the characters column:

# 3rd party apps use "pip install pandas lxml"
import pandas as pd

url = "https://docs.google.com/document/d/e/2PACX-1vRMx5YQlZNa3ra8dYYxmv-QIQ3YJe8tbI3kqcuC7lQiZm-CSEznKfN_HYNSpoXcZIV3Y_O3YoUB1ecq/pub"
c = pd.read_html(url)
print(c)

Output:

[              0          1             2
 0  x-coordinate  Character  y-coordinate
 1             0        â             0
 2             0        â             1
 3             0        â             2
 4             1        â             1
 5             1        â             2
 6             2        â             1
 7             2        â             2
 8             3        â             2]

When I print the specific characters I get this:

>>> c[0][1][1]
'â\x96\x88'

At first I assumed this was the hex number of the character but when I checked it I found that it wasn't. I'm not too sure what the significance of the â character is.


Solution

  • you can specify the encoding parameter in read_html() to handle the special character.

    You can try:

    url = "https://docs.google.com/document/d/e/2PACX-1vRMx5YQlZNa3ra8dYYxmv-QIQ3YJe8tbI3kqcuC7lQiZm-CSEznKfN_HYNSpoXcZIV3Y_O3YoUB1ecq/pub"
    c = pd.read_html(url, encoding='latin1')
    
    

    Or

    url = "https://docs.google.com/document/d/e/2PACX-1vRMx5YQlZNa3ra8dYYxmv-QIQ3YJe8tbI3kqcuC7lQiZm-CSEznKfN_HYNSpoXcZIV3Y_O3YoUB1ecq/pub"
    c = pd.read_html(url, encoding='utf-8')