c++visual-studio-2015polishpodofo

PoDoFo polish characters & PdfContentsTokenizer error


1.

How to get polish characters from pdf file? Can I somehow tell

PdfVariant::getString()

it will process polish characters? Becouse I get \200instead of ł for example and the funny thing is thats only when ł occures as first "nonbase" character. So if the pdf file begins with aaaałęąaaaa, the ł is coded like \200, the ę like \201 and ą like \202 but if pdf file begins with aaaaąęłaaaa, the ł is coded like \202, the ę like \201 and ą like \200 How can i get this characters in any system?

2.

When i'm trying to extract text from pdf file, I do something like this:

string input_name = "example.pdf";
PdfMemDocument pdf(input_name.c_str());
    for (int pn = 0; pn < pdf.GetPageCount(); ++pn) {
        PdfPage* page = pdf.GetPage(pn); 
        PdfContentsTokenizer tok(page);
        const char* token = nullptr;
        PdfVariant var;
        EPdfContentsType type;
        while (tok.ReadNext(type, token, var)) {
           //etc.

But I got problem with PdfContentsTokenizer tok(page); It doesn't work properly. For some pdf files it goes smoothly and for the other it throws Access violation reading location error in inffas32.asm file, 669 line:

L_get_length_code_mmx:
pand mm4,mm0
movd eax,mm4
movq mm4,mm3
mov  eax, [ebx+eax*4]//this is the error line

Btw, I noticed not every pdf file is coded in the same way. For example, using podofobrowser I couldn't see Hello World! text from the official podofo helloworld example. And for the others pdf files podofobrowser showed text in different ways or didn't show it at all.


Solution

  • Ad 1. The link to patch files which allows to extraxt polish text from pdf using TextExtractor.

    This is the most important line when it comes to extract non-unicode text from pdf:

    PdfString unicode = pCurFont->GetEncoding()->ConvertToUnicode( rString, pCurFont );
    

    Ad 2. The problem was zlib library which was built wrong. I rebuit it, rebuilt podofo and the problem is gone.