python-camelot

Camelot in python does not behave as expected


I have two pdf documents, both in same layout with different information. The problem is: I can read one perfectly but the other one the data is unrecognizable.

This is an example which I can read perfectly, download here: enter image description here

from_pdf = camelot.read_pdf('2019_05_2.pdf', flavor='stream', strict=False)
df_pdf = from_pdf[0].df


camelot.plot(from_pdf[0], kind='text').show()
print(from_pdf[0].parsing_report)

enter image description here

This is the dataframe as expected:

enter image description here

This is an example which after I read, the information is unrecognizable, download here: enter image description here

from_pdf = camelot.read_pdf('2020_04_2.pdf', flavor='stream', strict=False)
df_pdf = from_pdf[0].df


camelot.plot(from_pdf[0], kind='text').show()
print(from_pdf[0].parsing_report)

enter image description here

This is the dataframe with unrecognizable information:

enter image description here

I don't understand what I have done wrong and why the same code doesn't work for both files. I need some help, thanks.


Solution

  • The problem: malformed PDF


    Simply, the problem is that your second PDF is malformed / corrupted. It doesn't contain correct font information, so it is impossible to extract text from your PDF as is. It is a known and difficult problem (see this question).

    You can check this by trying to open the PDF with Google Docs. enter image description here

    Google Docs tries to extract the text and this is the result:enter image description here.

    Possible solutions


    If you want to extract the text, you can print the document to an image-based PDF and perform an OCR text extraction. However, Camelot does not currently support image-based PDFs, so it is not possible to extract the table.

    If you have no way to recover a well-formed PDF, you could try this strategy: