pythonpdfpdfmineradobe-reader

Convert PDF to text: Adobe Reader vs. Python libraries


I have a PDF which I try to convert to text for further processing.

The structure of the PDF is stable but tricky, as it also contains elements and graph that sometimes also serve as a background for the text that is written in the particular position. Therefore, I'd like to extract as much text as possible.

I first tried the Adobe Reader function to save the PDF as text which gives good results but doesn't allow to have this process fully automated. At least I don't know a way to interact with the Adobe Reader through the command line or.

Therefore, I tried some python libraries designed for this purpose but it seems that they have a different way to convert the pdf to text. I tried PdfMiner, PyPDF2 and pdftotext. None of the libraries give me the same result as the Adobe Reader.

The PDF looks like the following (a little cropped due to sensitive data which isn't relevant):

enter image description here

Adobe extracts the following text:

OCT 15° (4.3 mm) ART (25) Q: 34 [HR]

ILMILM200μm200μm 04590135180225270315360

TMPTSNSNASNITITMP

1000 800 600 400 200 0

Position [°]

CC 7.7 (APS)

G227(12%) T206(54%) TS226(20%) TI304(38%) N203(5%) NS213(6%) NI276(12%) Segmentationunconfirmed! Classification MRW Within Normal Limits

OCT ART (100) Q: 31 [HS]

ILMILMRNFLRNFL200μm200μm 111 04590135180225270315360

300 240 180 120 60 0

TMP TS NS NAS NI TI TMP

Position [°]

CC 7.7 (APS)

Classification RNFLT Outside Normal Limits

G78<1% T62(15%) TS103(5%) TI134(10%) N65(7%) NS77(3%) NI73(3%) Segmentationunconfirmed! RNFL Thickness (3.5 mm) [μm]

WithinNormalLimits(>5%) Borderline(<5%)OutsideNormalLimits(<1%)

While, for example, PDFminer extracts:

Average Thickness [�m]

Vol [mm�] 8.26

200 �m 200 �m

OCT 20.0� (5.6 mm) ART (21) Q: 25 [HS]

267 1.42

321 0.50

335 0.53

299 1.59

Center:

Central Min:

Central Max:

222 �m

221 �m

314 �m

Circle Diameters: 1, 3, 6 mm ETDRS

292 1.55

331 0.52

272 0.21

326 0.51

271 1.44

ILMILM

BMBM

200 �m 200 �m

Which is a lot different. Is there any reason for that and do you know any python library that has the same ability of the Adobe Reader to convert PDF to text?


Solution

  • Not necessarily an explanation as to why Adobe Reader extracts the text from a pdf differently as opposed to some python libraries but I have achieved a really good solution with tika.

    This is was tika extracted:

    OCT 15� (4.2 mm) ART (26) Q: 31 [HR]

    NITSTMP NAS TMPTINSM in

    im u

    m R

    im W

    id th

    [ �

    m ]

    1000 800 600 400 200

    0

    Position [�]

    36031527022518013590450

    ILMILM

    RNFLRNFL

    200 �m200 �m

    OCT ART (100) Q: 27 [HS]

    NITSTMP NAS TMPTINS

    R N

    F L T

    h ickn

    e ss (3

    .5 m

    m ) [�

    m ]

    300 240 180 120 60 0

    Position [�]

    36031527022518013590450

    40

    G 240

    (10%)

    T 239

    (70%)

    TS 213 (9%)

    TI 285

    (22%)

    N 230 (5%)

    NS 209 (3%)

    NI 283 (9%)

    CC 7.7 (APS)

    Segmentation unconfirmed!

    Classification MRW

    Borderline

    G 78

    <1%

    T 58

    (8%)

    TS 91

    (2%)

    TI 124 (6%)

    N 64

    (8%)

    NS 110

    (43%)

    NI 71

    (4%)

    CC 7.7 (APS)

    Segmentation unconfirmed!

    Classification RNFLT

    Outside Normal Limits

    Within Normal Limits (>5%)

    Borderline (<5%) Outside Normal Limits (<1%)

    Reference database: European Descent (2014)