pythoncsvamazon-textract

how to remove multiple headers


I have a spreadsheet that is in a pdf where I extract these values and transform them into .csv with textract from aws using Python. However, when I extract the values, there are several headers and I would like to keep only the first header.

account ;description ;old balance ;debit ;credit ;mov. ;balance ; **# --> first header**
1.00 ;: investments ;212.844.26 ;63.856.811,44 ;63.857.250.69 ;-439.25 ;212.405.01 ;
1.00 ;: investments ;212.844.26 ;63.856.811,44 ;63.857.250.69 ;-439.25 ;212.405.01 ;

account ;description ;old balance ;debit ;credit ;mov. ;balance ; **# --> second header**
2.00 ;: investments ;120.0400.0 ;20.000.000.0 ;82.840.400.0 ;-100.2 ;314.262.0 ;

;;;;debit ;credit ;mov. ;balance ; **# --> third header**
3.00 ;real state ;1.000.200.4 ;4.000.031.47 ;2.273.121,44 ;-144.089.77 ;254.844.390,75 ;

Note that in the same .csv file I have 3 headers and one of them only has a few values, but as I want to remove it, I believe it doesn't matter so much. So how to remove the other headers? using python. Each pdf has a different header, so I believe I can use the same solution in the others

OBS: this is the way I transform the string into csv

 # replace content
         with open("file_name.csv", "at") as fout:
            fout.write(table_csv)

I haven't tried any solutions as I can't think of anything useful


Solution

  • You can use re module to remove the duplicate headers. For example:

    text = """\
    account ;description ;old balance ;debit ;credit ;mov. ;balance ;
    1.00 ;: investments ;212.844.26 ;63.856.811,44 ;63.857.250.69 ;-439.25 ;212.405.01 ;
    1.00 ;: investments ;212.844.26 ;63.856.811,44 ;63.857.250.69 ;-439.25 ;212.405.01 ;
    
    account ;description ;old balance ;debit ;credit ;mov. ;balance ;
    2.00 ;: investments ;120.0400.0 ;20.000.000.0 ;82.840.400.0 ;-100.2 ;314.262.0 ;
    
    ;;;;debit ;credit ;mov. ;balance ;
    3.00 ;real state ;1.000.200.4 ;4.000.031.47 ;2.273.121,44 ;-144.089.77 ;254.844.390,75 ;"""
    
    import re
    import pandas as pd
    from io import StringIO
    
    # remove the headers
    text = re.sub(r"(?m)\n\n^.*$", "", text.strip())
    
    # remove ; at end of lines
    text = re.sub(r"(?m);\s*$", "", text.strip())
    
    print(text)
    

    Prints:

    account ;description ;old balance ;debit ;credit ;mov. ;balance 
    1.00 ;: investments ;212.844.26 ;63.856.811,44 ;63.857.250.69 ;-439.25 ;212.405.01 
    1.00 ;: investments ;212.844.26 ;63.856.811,44 ;63.857.250.69 ;-439.25 ;212.405.01 
    2.00 ;: investments ;120.0400.0 ;20.000.000.0 ;82.840.400.0 ;-100.2 ;314.262.0 
    3.00 ;real state ;1.000.200.4 ;4.000.031.47 ;2.273.121,44 ;-144.089.77 ;254.844.390,75 
    

    Then you can load the text to a DataFrame:

    df = pd.read_csv(StringIO(text), delimiter=";")
    print(df)
    

    Prints:

       account     description   old balance           debit          credit          mov.          balance 
    0       1.0  : investments    212.844.26   63.856.811,44   63.857.250.69       -439.25       212.405.01 
    1       1.0  : investments    212.844.26   63.856.811,44   63.857.250.69       -439.25       212.405.01 
    2       2.0  : investments    120.0400.0    20.000.000.0    82.840.400.0        -100.2        314.262.0 
    3       3.0     real state   1.000.200.4    4.000.031.47    2.273.121,44   -144.089.77   254.844.390,75 
    

    To save to CSV:

    df.to_csv('file_name.csv', index=False)