A pdf for which I am trying to extract a table from, correctly identifies the table but the table data is extracted as unicode rather than string data.
from tabula import read_pdf
df = read_pdf('https://watermark.silverchair.com/fsab153.pdf?token=AQECAHi208BE49Ooan9kkhW_Ercy7Dm3ZL_9Cf3qfKAc485ysgAAAs0wggLJBgkqhkiG9w0BBwagggK6MIICtgIBADCCAq8GCSqGSIb3DQEHATAeBglghkgBZQMEAS4wEQQMOXfntjWl9L87SyaXAgEQgIICgMSxXbyEzl4Y3sDeaGncgcE9V93d46LWUAnMiKz0KtHAKJA1HpPuefZZzrhJlD_hNUzK9C4uWwF1EfAbe0aWG3c_sFLetD5kqOWXzuGARvCRWOvmAEKpgtx0Desj5MY9lH7Zp7XxbfLBLScOIK6X_qEZ3Low6GkQfm1iBCbVHUg9ueKxLaYghX--uHPqmx43RZHk8bAjoDdMDT9lPsVXqlZJkmS2UT6T3uzC1jPTz3eON93C5CaEpW4lG_zvzMMltlZZm04Zz1vWd7WsXa_Gvc1gwO1AwUNcBxrRrr7Af5U02SPMaFF8dL0cOqrpw24LPzrg8ibtBq9yKidnCM-B2z74goz41kzv2KNZoPYQLj5XYlbyTknoE-MDo6cq_tGMw7igxbsrKUbGzSGILZ-bDQAVTyGKlU1QudNbZd4lDOe36kdr6dlhWHe7aK6vQgczTOYvQ0v1G5HwouxwTO0WPVpxawld76AZLhathmV4fMmNAYFpZDOytT4YAZEj-jjkPvzJ7HeA_-7ifmtwqLiOSILbLuJgEhLQ5frm9YXSn3crSInflJEsMm6Bs8pE_5H8vdex2tXzL6ZmHiDkDMdB_YM8iOhJGdMfZWsCJ0TcrtZyWZv5t-M1NzhLutplX-mYInE1sXZSTLHcOD0YDhEeMPNJhdGvISG_IbwDfH9OKuGQ0x8UCoe2DPVKOd53PYghKf2Bk8q7tILs3WeHgItnvRbkevjYS287gh_5052TKJJbC8dYxkVlHn-JCsbaMfn_SlYSaWjOfVxvSHKsVlFj5ry-cfScH8ai1bra8LASgwg4y_vpNeeDiA0CwZaPy2l_TF1O_yFsaKItyDkCMJXqhjI', pages=3)[0]
df['Unnamed: 0']
What is the correct way to extract the data in UTF-8 or ASCII?
Edit: something on my system (Debian) is able to interpret these codes though (see below) and the question is, how do I get this information out?
After trying various suggestions from the comments, I ended up creating a dictionary to map the UTF to the required digits. I wrote the extracted table to a csv file and applied the map to get readable data.
utf_map = {'\uf639':'0', '\uf6dc':'1', '\uf63a':'2', '\uf63b':'3', '\uf63c':'4',
'\uf63d':'5', '\uf63e':'6', '\uf63f':'7', '\uf640':'8', '\uf641':'9'}
with open('cod_catch.csv') as f:
string = f.read()
new_string = ''
for ch in string:
if ch==' ':
pass
elif ch in utf_map:
new_string += utf_map[ch]
else:
new_string += ch
with open('cod_catch_translated.csv', 'w') as f:
f.write(new_string)
cod_catch = pd.read_csv('cod_catch_translated.csv')
print(cod_catch)
Many thanks for all the suggestions!