c++multithreadingmfcxpdf

Make xpdf Pdf2Txt function as thread safe


I have tried to use xpdf source code into a MFC application to convert pdf to text. The code sample is taken from their site (or repository):

int Pdf2Txt(std::string PdfFile, std::string TxtFile) const
{
    GString* ownerPW, *userPW;
    UnicodeMap* uMap;
    TextOutputDev* textOut;
    TextOutputControl textOutControl;
    GString* textFileName;

    int exitCode;
    char textEncName[128] = "";
    char textEOL[16] = "";
    GBool noPageBreaks = gFalse;
    GBool quiet = gFalse;
    char ownerPassword[33] = "\001";
    char userPassword[33] = "\001";
    int firstPage = 1;
    int lastPage = 0;
    GBool tableLayout = gFalse;
    double fixedPitch = 0;
    GBool physLayout = gFalse;
    GBool simpleLayout = gFalse;
    GBool simple2Layout = gFalse;
    GBool linePrinter = gFalse;
    GBool rawOrder = gFalse;
    double fixedLineSpacing = 0;
    double marginLeft = 0;
    double marginRight = 0;
    double marginTop = 0;
    double marginBottom = 0;
    GBool clipText = gFalse;
    GBool discardDiag = gFalse;
    GBool insertBOM = gFalse;

    exitCode = 99;

    // read config file
    globalParams = new GlobalParams("");
    if (textEncName[0])
    {
        globalParams->setTextEncoding(textEncName);
    }
    if (textEOL[0])
    {
        if (!globalParams->setTextEOL(textEOL))
        {
            fprintf(stderr, "Bad '-eol' value on command line\n");
        }
    }
    if (noPageBreaks)
    {
        globalParams->setTextPageBreaks(gFalse);
    }
    if (quiet)
    {
        globalParams->setErrQuiet(quiet);
    }

    // Set UNICODE support
    globalParams->setTextEncoding("UTF-8");

    // get mapping to output encoding
    if (!(uMap = globalParams->getTextEncoding()))
    {
        error(errConfig, -1, "Couldn't get text encoding");
        goto err1;
    }

    // open PDF file
    if (ownerPassword[0] != '\001')
    {
        ownerPW = new GString(ownerPassword);
    }
    else
    {
        ownerPW = NULL;
    }
    if (userPassword[0] != '\001')
    {
        userPW = new GString(userPassword);
    }
    else
    {
        userPW = NULL;
    }
    PDFDoc* doc = new PDFDoc((char*)PdfFile.c_str(), ownerPW, userPW);
    if (userPW)
    {
        delete userPW;
    }
    if (ownerPW)
    {
        delete ownerPW;
    }
    if (! doc->isOk())
    {
        exitCode = 1;
        goto err2;
    }

    // check for copy permission
    if (! doc->okToCopy())
    {
        error(errNotAllowed, -1, "Copying of text from this document is not allowed.");
        exitCode = 3;
        goto err2;
    }

    // construct text file name
    textFileName = new GString(TxtFile.c_str());

    // get page range
    if (firstPage < 1)
    {
        firstPage = 1;
    }
    if (lastPage < 1 || lastPage > doc->getNumPages())
    {
        lastPage = doc->getNumPages();
    }

    // write text file
    if (tableLayout)
    {
        textOutControl.mode = textOutTableLayout;
        textOutControl.fixedPitch = fixedPitch;
    }
    else if (physLayout)
    {
        textOutControl.mode = textOutPhysLayout;
        textOutControl.fixedPitch = fixedPitch;
    }
    else if (simpleLayout)
    {
        textOutControl.mode = textOutSimpleLayout;
    }
    else if (simple2Layout)
    {
        textOutControl.mode = textOutSimple2Layout;
    }
    else if (linePrinter)
    {
        textOutControl.mode = textOutLinePrinter;
        textOutControl.fixedPitch = fixedPitch;
        textOutControl.fixedLineSpacing = fixedLineSpacing;
    }
    else if (rawOrder)
    {
        textOutControl.mode = textOutRawOrder;
    }
    else
    {
        textOutControl.mode = textOutReadingOrder;
    }
    textOutControl.clipText = clipText;
    textOutControl.discardDiagonalText = discardDiag;
    textOutControl.insertBOM = insertBOM;
    textOutControl.marginLeft = marginLeft;
    textOutControl.marginRight = marginRight;
    textOutControl.marginTop = marginTop;
    textOutControl.marginBottom = marginBottom;
    textOut = new TextOutputDev(textFileName->getCString(), &textOutControl, gFalse, gTrue);
    if (textOut->isOk())
    {
        doc->displayPages(textOut, firstPage, lastPage, 72, 72, 0, gFalse, gTrue, gFalse);
    }
    else
    {
        delete textOut;
        exitCode = 2;
        goto err3;
    }
    delete textOut;

    exitCode = 0;

    // clean up
err3:
    delete textFileName;
err2:
    delete doc;
//  uMap->decRefCnt();
err1:
    delete globalParams;

    // check for memory leaks
    Object::memCheck(stderr);
    gMemReport(stderr);

    return exitCode;
}

So far, so good. But this code isn't thread safe: if I am trying run this code inside a multi-threading code, it crashes:

// TextOutputDev.cc
if (uMap->isUnicode())
{

    lreLen = uMap->mapUnicode(0x202a, lre, sizeof(lre)); // <-- crash

Why ? Because there is a variable, globalParams, which is deleted in the last lines of the function, and it's common for all threads:

delete globalParams;

And globalParams it's an extern global variable from GlobalParams.h (part of xpdf code):

// xpdf/GlobalParams.h
// The global parameters object.
extern GlobalParams *globalParams;

How can I do this function thread safe ? Because the "problem variable" it's inside xpdf source code, not in mine ...

P.S. To sum up things, globalParams it's declared in xpdf code, and it's cleaned in my (client) code.

The xpdf source code could be seen here: https://github.com/jeroen/xpdf/blob/c2c946f517eb09cfd09d957e0f3b04d44bf6f827/src/poppler/GlobalParams.h

and

https://github.com/jeroen/xpdf/blob/c2c946f517eb09cfd09d957e0f3b04d44bf6f827/src/poppler/GlobalParams.cc


Solution

  • Try restructuring your code as shown below. I have moved the GlobalParams initialization code into a separate function. This function should be called (once) during initialization, or before starting the threads that call Pdf2Txt(). And of course the GlobalParams instance shouldn't be destroyed because it can be used by multiple threads. It won't hurt your app to keep it memory, it's one object anyway and not really large - well, it contains many int and bool member variables, but these do not take up much space, and quite a few string* variables (initially null or emtpy I guess), so it's just a few KB at most.

    bool InitGlobalParams()
    {
        UnicodeMap* uMap;
        char textEncName[128] = "";
        char textEOL[16] = "";
        GBool noPageBreaks = gFalse;
        GBool quiet = gFalse;
    
        // read config file
        globalParams = new GlobalParams(""); // <-- Maybe add some checking code here?
        if (textEncName[0])
        {
            globalParams->setTextEncoding(textEncName);
        }
        if (textEOL[0])
        {
            if (!globalParams->setTextEOL(textEOL))
            {
                fprintf(stderr, "Bad '-eol' value on command line\n");
            }
        }
        if (noPageBreaks)
        {
            globalParams->setTextPageBreaks(gFalse);
        }
        if (quiet)
        {
            globalParams->setErrQuiet(quiet);
        }
    
        // Set UNICODE support
        globalParams->setTextEncoding("UTF-8");
    
        // get mapping to output encoding
        if (!(uMap = globalParams->getTextEncoding()))
        {
            error(errConfig, -1, "Couldn't get text encoding");
            return false;
        }
        return true;
    }
    
    int Pdf2Txt(std::string PdfFile, std::string TxtFile) const
    {
        GString* ownerPW, *userPW;
        TextOutputDev* textOut;
        TextOutputControl textOutControl;
        GString* textFileName;
    
        int exitCode;
        char ownerPassword[33] = "\001";
        char userPassword[33] = "\001";
        int firstPage = 1;
        int lastPage = 0;
        GBool tableLayout = gFalse;
        double fixedPitch = 0;
        GBool physLayout = gFalse;
        GBool simpleLayout = gFalse;
        GBool simple2Layout = gFalse;
        GBool linePrinter = gFalse;
        GBool rawOrder = gFalse;
        double fixedLineSpacing = 0;
        double marginLeft = 0;
        double marginRight = 0;
        double marginTop = 0;
        double marginBottom = 0;
        GBool clipText = gFalse;
        GBool discardDiag = gFalse;
        GBool insertBOM = gFalse;
    
        exitCode = 99;
    
        // open PDF file
        if (ownerPassword[0] != '\001')
        {
            ownerPW = new GString(ownerPassword);
        }
        else
        {
            ownerPW = NULL;
        }
        if (userPassword[0] != '\001')
        {
            userPW = new GString(userPassword);
        }
        else
        {
            userPW = NULL;
        }
        PDFDoc* doc = new PDFDoc((char*)PdfFile.c_str(), ownerPW, userPW);
        if (userPW)
        {
            delete userPW;
        }
        if (ownerPW)
        {
            delete ownerPW;
        }
        if (! doc->isOk())
        {
            exitCode = 1;
            goto err2;
        }
    
        // check for copy permission
        if (! doc->okToCopy())
        {
            error(errNotAllowed, -1, "Copying of text from this document is not allowed.");
            exitCode = 3;
            goto err2;
        }
    
        // construct text file name
        textFileName = new GString(TxtFile.c_str());
    
        // get page range
        if (firstPage < 1)
        {
            firstPage = 1;
        }
        if (lastPage < 1 || lastPage > doc->getNumPages())
        {
            lastPage = doc->getNumPages();
        }
    
        // write text file
        if (tableLayout)
        {
            textOutControl.mode = textOutTableLayout;
            textOutControl.fixedPitch = fixedPitch;
        }
        else if (physLayout)
        {
            textOutControl.mode = textOutPhysLayout;
            textOutControl.fixedPitch = fixedPitch;
        }
        else if (simpleLayout)
        {
            textOutControl.mode = textOutSimpleLayout;
        }
        else if (simple2Layout)
        {
            textOutControl.mode = textOutSimple2Layout;
        }
        else if (linePrinter)
        {
            textOutControl.mode = textOutLinePrinter;
            textOutControl.fixedPitch = fixedPitch;
            textOutControl.fixedLineSpacing = fixedLineSpacing;
        }
        else if (rawOrder)
        {
            textOutControl.mode = textOutRawOrder;
        }
        else
        {
            textOutControl.mode = textOutReadingOrder;
        }
        textOutControl.clipText = clipText;
        textOutControl.discardDiagonalText = discardDiag;
        textOutControl.insertBOM = insertBOM;
        textOutControl.marginLeft = marginLeft;
        textOutControl.marginRight = marginRight;
        textOutControl.marginTop = marginTop;
        textOutControl.marginBottom = marginBottom;
        textOut = new TextOutputDev(textFileName->getCString(), &textOutControl, gFalse, gTrue);
        if (textOut->isOk())
        {
            doc->displayPages(textOut, firstPage, lastPage, 72, 72, 0, gFalse, gTrue, gFalse);
        }
        else
        {
            delete textOut;
            exitCode = 2;
            goto err3;
        }
        delete textOut;
    
        exitCode = 0;
    
        // clean up
    err3:
        delete textFileName;
    err2:
        delete doc;
    //  uMap->decRefCnt();
    err1:
        // Do NOT delete the one and only GlobalParams instance!!!
        //delete globalParams;
    
        // check for memory leaks
        Object::memCheck(stderr);
        gMemReport(stderr);
    
        return exitCode;
    }
    

    The above code may not even compile (I modified it with a text editor, not really tested it) so please make any changes that may be required. It is quite expected that the xpdf functions do not modify the globalParams object (it's "read-only" for them) so this code has a good chance to work. Btw there is a #if MULTITHREADED directive in the GlobalParams class definition (GlobalParams.h) containing 3 mutex objects in its block. The implementation code (GlobalParams.cc) locks a mutex to access the GlobalParams members, so this may cause some threads to wait a little, although I can't tell how much (one would have to thoroughly examine the code, which is a small "project" in itself). You can try testing it.

    Of course the concerns expressed by @KJ above still apply, running many such threads in parallel could overload the system (although i'm not sure if xpdf uses multiple threads to process a single PDF, could one help please, how is it configured?), esp if you are running this on a server do not allow too many concurrently-running conversions, it may cause other processes to slow down. It may also cause I/O bottleneck (disk and/or network), so experiment with few threads initially and check how it scales up.