pythonpandashtml5lib

How to correctly parse HTML to Unicode strings with pandas?


I'm running a Python program which fetches a UTF-8-encoded web page, and I extract some text from HTML table using pandas(read_html) and write result to csv file

However, when I write this text to a file,all spaces in it gets written in an unexpected encoding (example \xd0\xb9\xd1\x82\xd0\xb8). to solve the problem I added a line i = i.split(" ") after, all spaces in csv file substitutes for characters, the example below:

['0', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '1', '', '', '', '', '', '', '', '', '', '', '', '', '', '2', '', '', '3\n0', '', '', '', '', '', '', '', 'number', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'last name', '', 'number', 'plan', 'NaN\n1', '', '', '', '', '', '', '', '', '', 'NaN', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'NaN', '', '', 'not', 'NaN\n2', '', '', '', '', '53494580', '', '', '', '', '', '', '', '', '', '+', '(53)494580', '', '', '', '', '', '', '', '', 'NP_551', 'NaN\n3', '', '', '', '', '53494581', '', '', '', '', '', '', '', '', '', '+', '(53)494581', '', '', '', '', '', '', '', '', 'NP_551', 'NaN\n4', '', '', '', '']

I would like to get rid of character ( '', ) Is there a way to fix this? Any pointers would be much appreciated.

code python:

import pandas as pd
import html5lib

filename="1.csv" 
file=open(filename,"w",encoding='UTF-8', newline='\n'); 
output=csv.writer(file, dialect='excel',delimiter =' ') 

r = requests.get('http://10.45.87.12/og?sh=1&CallerName=&Sys=.79.83.86.51&')
pd.set_option('max_rows',10000) 
df = pd.read_html(r.content)

for i in df:
    i = str(i)
    i = i.strip()
    i = i.encode('UTF-8').decode('UTF-8')
    i = i.split(" ")
    output.writerow(i)
 file.close()

Solution

  • You can use the filter method to remove of empty values. you can add the below snippet after 'i = i.split(" ")'

    A = ['0', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '1', '', '', '', '', '', '', '', '', '', '', '', '', '', '2', '', '', '3\n0', '', '', '', '', '', '', '', 'number', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'last name', '', 'number', 'plan', 'NaN\n1', '', '', '', '', '', '', '', '', '', 'NaN', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'NaN', '', '', 'not', 'NaN\n2', '', '', '', '', '53494580', '', '', '', '', '', '', '', '', '', '+', '(53)494580', '', '', '', '', '', '', '', '', 'NP_551', 'NaN\n3', '', '', '', '', '53494581', '', '', '', '', '', '', '', '', '', '+', '(53)494581', '', '', '', '', '', '', '', '', 'NP_551', 'NaN\n4', '', '', '', '']
    print filter(None, A)
    

    Output:

    ['0', '1', '2', '3\n0', 'number', 'last name', 'number', 'plan', 'NaN\n1', 'NaN', 'NaN', 'not', 'NaN\n2', '53494580', '+', '(53)494580', 'NP_551', 'NaN\n3', '53494581', '+', '(53)494581', 'NP_551', 'NaN\n4']