pythonencodingtabula

python tabula error encoding for pdf read


I would like to export tables from a PDF in a dataframe or take a csv file. But I cannot read a PDF file with Python. What do I need to do? I tried reading the PDF with Python tabula:

from tabula import read_pdf

df = read_pdf(name)

and i take :

> pages' argument isn't specified.Will extract only from page 1 by default.
Got stderr: Dec 28, 2021 1:14:07 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+564 (564) in font Calibri,Bold-Identity-H
Dec 28, 2021 1:14:07 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+639 (639) in font Calibri,Bold-Identity-H
Dec 28, 2021 1:14:07 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+632 (632) in font Calibri,Bold-Identity-H
Dec 28, 2021 1:14:07 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+657 (657) in font Calibri,Bold-Identity-H
Dec 28, 2021 1:14:07 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+637 (637) in font Calibri,Bold-Identity-H
Dec 28, 2021 1:14:07 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+656 (656) in font Calibri,Bold-Identity-H
Dec 28, 2021 1:14:07 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+646 (646) in font Calibri,Bold-Identity-H
Dec 28, 2021 1:14:07 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+653 (653) in font Calibri,Bold-Identity-H
Dec 28, 2021 1:14:07 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+635 (635) in font Calibri,Bold-Identity-H
Dec 28, 2021 1:14:07 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+574 (574) in font Calibri,Bold-Identity-H
Dec 28, 2021 1:14:07 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+664 (664) in font Calibri,Bold-Identity-H
Dec 28, 2021 1:14:07 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+631 (631) in font Calibri,Bold-Identity-H
Dec 28, 2021 1:14:07 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+585 (585) in font Calibri,Bold-Identity-H
Dec 28, 2021 1:14:07 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+581 (581) in font Calibri,Bold-Identity-H
Dec 28, 2021 1:14:07 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+570 (570) in font Calibri,Bold-Identity-H
Dec 28, 2021 1:14:07 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode

Solution

  • In comments I suggested the contents of the PDF were at fault since the Greek words had not been encoded correctly,

    looks like a poor quality PDF from that many warnings Thus I first suggest before investing any more time in tuning for a suspect source you 1st verify that cut and paste that table may in fact result in something worth capturing. My initial assessment is you might get just the second column with numbers and there is some hidden Greek words that are not showing but the rest is garbage, thus the only valid extraction could be by using an OCR method.

    Thus the best approach to correct that PDF first would be to use OCR, however many attempts at OCR are also mislead by the existing contents of the PDF.

    So in that case, the best working solution is to OCR afresh from images. As an example I printed the first page badly, however, it was for proof of concept that an image route may get you closer to your goal.

    I only have currently a means to export as monochrome tiff via 200dpi fax, you would get much better results using grey-scale as .png .pbm or .tif[f] (NOT jpg)

    enter image description here

    Once converted to plain text docx or xlsx etc. it should produce something like this, ignore the poor headings in this sample, that was a byproduct of using such a crude attempt in monochrome with a dotty background.

    enter image description here

    Clearly the result needs some clean-up to match the input such as spell checking, but should then be good enough for textual processing by any further means. A better choice of image resolution and a target output such as csv might have got a better usable result, thus closer answer to your question.