pythonunicodetabula

Tabula-py reads column data as unicode


A pdf for which I am trying to extract a table from, correctly identifies the table but the table data is extracted as unicode rather than string data.

from tabula import read_pdf
df = read_pdf('https://watermark.silverchair.com/fsab153.pdf?token=AQECAHi208BE49Ooan9kkhW_Ercy7Dm3ZL_9Cf3qfKAc485ysgAAAs0wggLJBgkqhkiG9w0BBwagggK6MIICtgIBADCCAq8GCSqGSIb3DQEHATAeBglghkgBZQMEAS4wEQQMOXfntjWl9L87SyaXAgEQgIICgMSxXbyEzl4Y3sDeaGncgcE9V93d46LWUAnMiKz0KtHAKJA1HpPuefZZzrhJlD_hNUzK9C4uWwF1EfAbe0aWG3c_sFLetD5kqOWXzuGARvCRWOvmAEKpgtx0Desj5MY9lH7Zp7XxbfLBLScOIK6X_qEZ3Low6GkQfm1iBCbVHUg9ueKxLaYghX--uHPqmx43RZHk8bAjoDdMDT9lPsVXqlZJkmS2UT6T3uzC1jPTz3eON93C5CaEpW4lG_zvzMMltlZZm04Zz1vWd7WsXa_Gvc1gwO1AwUNcBxrRrr7Af5U02SPMaFF8dL0cOqrpw24LPzrg8ibtBq9yKidnCM-B2z74goz41kzv2KNZoPYQLj5XYlbyTknoE-MDo6cq_tGMw7igxbsrKUbGzSGILZ-bDQAVTyGKlU1QudNbZd4lDOe36kdr6dlhWHe7aK6vQgczTOYvQ0v1G5HwouxwTO0WPVpxawld76AZLhathmV4fMmNAYFpZDOytT4YAZEj-jjkPvzJ7HeA_-7ifmtwqLiOSILbLuJgEhLQ5frm9YXSn3crSInflJEsMm6Bs8pE_5H8vdex2tXzL6ZmHiDkDMdB_YM8iOhJGdMfZWsCJ0TcrtZyWZv5t-M1NzhLutplX-mYInE1sXZSTLHcOD0YDhEeMPNJhdGvISG_IbwDfH9OKuGQ0x8UCoe2DPVKOd53PYghKf2Bk8q7tILs3WeHgItnvRbkevjYS287gh_5052TKJJbC8dYxkVlHn-JCsbaMfn_SlYSaWjOfVxvSHKsVlFj5ry-cfScH8ai1bra8LASgwg4y_vpNeeDiA0CwZaPy2l_TF1O_yFsaKItyDkCMJXqhjI', pages=3)[0]
df['Unnamed: 0']

screenshot

What is the correct way to extract the data in UTF-8 or ASCII?

Edit: something on my system (Debian) is able to interpret these codes though (see below) and the question is, how do I get this information out?

screenshot with whole DataFrame


Solution

  • After trying various suggestions from the comments, I ended up creating a dictionary to map the UTF to the required digits. I wrote the extracted table to a csv file and applied the map to get readable data.

    utf_map = {'\uf639':'0', '\uf6dc':'1', '\uf63a':'2', '\uf63b':'3', '\uf63c':'4',
               '\uf63d':'5', '\uf63e':'6', '\uf63f':'7', '\uf640':'8', '\uf641':'9'}
    
    with open('cod_catch.csv') as f:
        string = f.read()
        new_string = ''
        for ch in string:
            if ch==' ':
                pass
            elif ch in utf_map:
                new_string += utf_map[ch]
            else:
                new_string += ch
    
    with open('cod_catch_translated.csv', 'w') as f:
        f.write(new_string)
    
    cod_catch = pd.read_csv('cod_catch_translated.csv')
    
    print(cod_catch)
    

    enter image description here

    Many thanks for all the suggestions!