pythonpandasdataframepdfpython-camelot

Camelot PDF failing to strip text


I have this pdf and I'm trying to work on it's very first table.

The issue happens when the name of the employer (EMPREGADOR) reaches two lines.

enter image description here

I'm using the following command to try to strip the data correctly:

tables = camelot.read_pdf('tipo1/t1_3.pdf', pages='1', flavor='stream', edge_tol=500, strip_text='\n')
df = tables[0].df
print(df)

But the result is the following:

                      0                             1                           2
0            EMPREGADOR              DATA DE ADMISSÃO                   PIS/PASEP
1           ABC ABC ABC                                                          
2                                          07/01/2008                   123123123
3                  LTDA                                                          
4  CARTEIRA DE TRABALHO       INSCRIÇÃO DO EMPREGADOR             NÚMERO DA CONTA
5                123123                        123123                  1231231231
6         DATA DE OPÇÃO  DATA E CÓDIGO DE AFASTAMENTO                   CATEGORIA
7            07/01/2008               30/09/2011 - N2                           1
8         TIPO DE CONTA                 TAXA DE JUROS  VALOR PARA FINS RECISÓRIOS
9               OPTANTE                      3.0% a.a                     R$ 0,00

Tried reading the docs and didn't find anything that could help me getting the employer's (EMPREGADOR) data correctly (in this case, ABC ABC ABC LTDA).

This is an issue because the lenght of the employer's name may vary to 1, 2, 3 or even more lines, making a mess in the DF and, therefore, hard to code.

Any suggestion?


Solution

  • As mentioned by Stefano Fiorucci in the comments, Camelot currently does not support the feature needed. Solution was to manipulate the data manually.