c++xmlmsxml6

C++ read XML files in pieces


I'm doing an exercise with the MsXML6 library with Visual C++ in order to shake up my dependence on interpreter languages like python for analysing big files. I was following the tutorial on msdn, however when substituting the XML file for a much larger one (upwards of 300MB), the program displays the error that it was unable to locate the file and the subsequent variant is NULL.

Tutorial: https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ms767609(v%3dvs.85)

    HRESULT hr = S_OK;
    IXMLDOMDocument *pXMLDom = NULL;
    IXMLDOMNodeList *pNodes = NULL;
    IXMLDOMNode *pNode = NULL;

    BSTR bstrQuery1 = NULL;
    BSTR bstrQuery2 = NULL;
    BSTR bstrNodeName = NULL;
    BSTR bstrNodeValue = NULL;
    DOMNodeType DOMType;
    VARIANT varNodeValue;
    VARIANT_BOOL varStatus;
    VARIANT varFileName;
    VariantInit(&varFileName);

    CHK_HR(CreateAndInitDOM(&pXMLDom));

    CHK_HR(VariantFromString(L"TestDoc.xml", varFileName));
    CHK_HR(pXMLDom->load(varFileName, &varStatus));
    if (varStatus != VARIANT_TRUE)
    {
        CHK_HR(ReportParseError(pXMLDom, "Failed to load DOM from TestDoc.xml"));
        initSuccessful = false;
    }
    else
    {
        //Assigns the DOM object as a member variable to be used in other methods
        pXMLDomClassVar = pXMLDom;
        initSuccessful = true;
    }

I'd really appreciate some help with this.


Solution

  • While talking about XML DOM, you should consider this is just an in-memory database created on-the-fly via full analysis of that XML file. Dealing with big XML files via DOM approach is a very bad practice due enormous memory consumption and low performance then. (content itself, indexes and cross-links, etc.) Even 10 Mb of XML DOM is a notable thing in terms of performance, and you're going with 30x times of that!

    Instead, on big XML files you should use "SAX parsing" approach that can operate even on endless XML streams. It's just up to you to store XML excerpts you're interested with, and ignore the rest.