I have a PDF which I try to convert to text for further processing.
The structure of the PDF is stable but tricky, as it also contains elements and graph that sometimes also serve as a background for the text that is written in the particular position. Therefore, I'd like to extract as much text as possible.
I first tried the Adobe Reader function to save the PDF as text which gives good results but doesn't allow to have this process fully automated. At least I don't know a way to interact with the Adobe Reader through the command line or.
Therefore, I tried some python libraries designed for this purpose but it seems that they have a different way to convert the pdf to text. I tried PdfMiner, PyPDF2 and pdftotext. None of the libraries give me the same result as the Adobe Reader.
The PDF looks like the following (a little cropped due to sensitive data which isn't relevant):
Adobe extracts the following text:
OCT 15° (4.3 mm) ART (25) Q: 34 [HR]
ILMILM200μm200μm 04590135180225270315360
TMPTSNSNASNITITMP
1000 800 600 400 200 0
Position [°]
CC 7.7 (APS)
G227(12%) T206(54%) TS226(20%) TI304(38%) N203(5%) NS213(6%) NI276(12%) Segmentationunconfirmed! Classification MRW Within Normal Limits
OCT ART (100) Q: 31 [HS]
ILMILMRNFLRNFL200μm200μm 111 04590135180225270315360
300 240 180 120 60 0
TMP TS NS NAS NI TI TMP
Position [°]
CC 7.7 (APS)
Classification RNFLT Outside Normal Limits
G78<1% T62(15%) TS103(5%) TI134(10%) N65(7%) NS77(3%) NI73(3%) Segmentationunconfirmed! RNFL Thickness (3.5 mm) [μm]
WithinNormalLimits(>5%) Borderline(<5%)OutsideNormalLimits(<1%)
While, for example, PDFminer extracts:
Average Thickness [�m]
Vol [mm�] 8.26
200 �m 200 �m
OCT 20.0� (5.6 mm) ART (21) Q: 25 [HS]
267 1.42
321 0.50
335 0.53
299 1.59
Center:
Central Min:
Central Max:
222 �m
221 �m
314 �m
Circle Diameters: 1, 3, 6 mm ETDRS
292 1.55
331 0.52
272 0.21
326 0.51
271 1.44
ILMILM
BMBM
200 �m 200 �m
Which is a lot different. Is there any reason for that and do you know any python library that has the same ability of the Adobe Reader to convert PDF to text?
Not necessarily an explanation as to why Adobe Reader extracts the text from a pdf differently as opposed to some python libraries but I have achieved a really good solution with tika.
This is was tika extracted:
OCT 15� (4.2 mm) ART (26) Q: 31 [HR]
NITSTMP NAS TMPTINSM in
im u
m R
im W
id th
[ �
m ]
1000 800 600 400 200
0
Position [�]
36031527022518013590450
ILMILM
RNFLRNFL
200 �m200 �m
OCT ART (100) Q: 27 [HS]
NITSTMP NAS TMPTINS
R N
F L T
h ickn
e ss (3
.5 m
m ) [�
m ]
300 240 180 120 60 0
Position [�]
36031527022518013590450
40
G 240
(10%)
T 239
(70%)
TS 213 (9%)
TI 285
(22%)
N 230 (5%)
NS 209 (3%)
NI 283 (9%)
CC 7.7 (APS)
Segmentation unconfirmed!
Classification MRW
Borderline
G 78
<1%
T 58
(8%)
TS 91
(2%)
TI 124 (6%)
N 64
(8%)
NS 110
(43%)
NI 71
(4%)
CC 7.7 (APS)
Segmentation unconfirmed!
Classification RNFLT
Outside Normal Limits
Within Normal Limits (>5%)
Borderline (<5%) Outside Normal Limits (<1%)
Reference database: European Descent (2014)