[SOLVED] PoDoFo polish characters & PdfContentsTokenizer error

PoDoFo polish characters & PdfContentsTokenizer error

How to get polish characters from pdf file? Can I somehow tell

PdfVariant::getString()

it will process polish characters? Becouse I get \200instead of ł for example and the funny thing is thats only when ł occures as first "nonbase" character. So if the pdf file begins with aaaałęąaaaa, the ł is coded like \200, the ę like \201 and ą like \202 but if pdf file begins with aaaaąęłaaaa, the ł is coded like \202, the ę like \201 and ą like \200 How can i get this characters in any system?

When i'm trying to extract text from pdf file, I do something like this:

string input_name = "example.pdf";
PdfMemDocument pdf(input_name.c_str());
    for (int pn = 0; pn < pdf.GetPageCount(); ++pn) {
        PdfPage* page = pdf.GetPage(pn); 
        PdfContentsTokenizer tok(page);
        const char* token = nullptr;
        PdfVariant var;
        EPdfContentsType type;
        while (tok.ReadNext(type, token, var)) {
           //etc.

But I got problem with PdfContentsTokenizer tok(page); It doesn't work properly. For some pdf files it goes smoothly and for the other it throws Access violation reading location error in inffas32.asm file, 669 line:

L_get_length_code_mmx:
pand mm4,mm0
movd eax,mm4
movq mm4,mm3
mov  eax, [ebx+eax*4]//this is the error line

Btw, I noticed not every pdf file is coded in the same way. For example, using podofobrowser I couldn't see Hello World! text from the official podofo helloworld example. And for the others pdf files podofobrowser showed text in different ways or didn't show it at all.

Solution

Ad 1. The link to patch files which allows to extraxt polish text from pdf using TextExtractor.

This is the most important line when it comes to extract non-unicode text from pdf:

PdfString unicode = pCurFont->GetEncoding()->ConvertToUnicode( rString, pCurFont );

Ad 2. The problem was zlib library which was built wrong. I rebuit it, rebuilt podofo and the problem is gone.