I use PDFNet (version 9.308007) to convert pdf files into text format. Recently needed to upgrade from Ubuntu 16.04 to Ubuntu 20.04. The problem is that words changed order in output files when convert with PDFNet on Ubuntu 20.04. For ex.:
Ubuntu 16.04
'\r\n -$14,309.29\r\n Payment - 12/19/2022 - Thank You;
Ubuntu 20.04
'Payment - 12/19/2022 - Thank You -$14,309.29\r\n'
I need words order exactly as in first variant (Ubuntu 16.04). Will be very grateful if there will be at least some hints where to dig further.
Assuming not all fonts in the PDF are embedded, then the issue is that there are different fonts installed on the two systems, and when PDFNet does font substitution (for the non-embedded font) these other fonts have different metrics and glyphs. This subtle difference in font metrics and glyphs can affect text run detection and result in different text extraction output.
Update the Ubuntu 20 system to have the same fonts as the Ubuntu 16 system and this should result in the same font substitutions and therefore same text extraction ordering.