1.
How to get polish characters from pdf file? Can I somehow tell
PdfVariant::getString()
it will process polish characters?
Becouse I get \200
instead of ł
for example and the funny thing is thats only when ł
occures as first "nonbase" character. So if the pdf file begins with aaaałęąaaaa
, the ł
is coded like \200
, the ę
like \201
and ą
like \202
but if pdf file begins with aaaaąęłaaaa
, the ł
is coded like \202
, the ę
like \201
and ą
like \200
How can i get this characters in any system?
2.
When i'm trying to extract text from pdf file, I do something like this:
string input_name = "example.pdf";
PdfMemDocument pdf(input_name.c_str());
for (int pn = 0; pn < pdf.GetPageCount(); ++pn) {
PdfPage* page = pdf.GetPage(pn);
PdfContentsTokenizer tok(page);
const char* token = nullptr;
PdfVariant var;
EPdfContentsType type;
while (tok.ReadNext(type, token, var)) {
//etc.
But I got problem with PdfContentsTokenizer tok(page);
It doesn't work properly. For some pdf files it goes smoothly and for the other it throws Access violation reading location
error in inffas32.asm
file, 669
line:
L_get_length_code_mmx:
pand mm4,mm0
movd eax,mm4
movq mm4,mm3
mov eax, [ebx+eax*4]//this is the error line
Btw, I noticed not every pdf file is coded in the same way. For example, using podofobrowser I couldn't see Hello World!
text from the official podofo helloworld example. And for the others pdf files podofobrowser showed text in different ways or didn't show it at all.
Ad 1. The link to patch files which allows to extraxt polish text from pdf using TextExtractor.
This is the most important line when it comes to extract non-unicode text from pdf:
PdfString unicode = pCurFont->GetEncoding()->ConvertToUnicode( rString, pCurFont );
Ad 2. The problem was zlib library which was built wrong. I rebuit it, rebuilt podofo and the problem is gone.