c++xmlc++14rapidxml

C++ RapidXml - Traversing with first_node() to modify the value of a node in an XML file


I must be going crazy here... Here is my XML file, called Original.xml:

<root>
  <metadata>Trying to change this</metadata>
  <body>
    <salad>Greek Caesar</salad>
  </body>
</root>

I am trying to modify the contents within the metadata tag.

Here is my entire piece of code that I have that WORKS:

#include <iostream>

#include <rapidxml/rapidxml_print.hpp>
#include <rapidxml/rapidxml_utils.hpp>

int main()
{
  // Open 'Original.xml' to read from
  rapidxml::file<> xmlFile("Original.xml");
  rapidxml::xml_document<> doc;
  doc.parse<0>(xmlFile.data());

  // Get to <metadata> tag
  //                                       <root>        <metadata>    ???
  rapidxml::xml_node<>* metadataNode = doc.first_node()->first_node()->first_node();
  
  // Always correctly prints: 'Trying to change this'
  std::cout << "Before: " << metadataNode->value() << std::endl;

  // Modify the contents within <metadata>
  const std::string newMetadataValue = "Did the changing";
  metadataNode->value(newMetadataValue.c_str());

  // Always correctly prints: 'Did the changing'
  std::cout << "After: " << metadataNode->value() << std::endl;

  // Save output to 'New.xml'
  std::ofstream newXmlFile("New.xml");
  newXmlFile << doc;
  newXmlFile.close();
  doc.clear();

  return 0;
}

New.xml will now look like this:

<root>
    <metadata>Did the changing</metadata>
    <body>
        <salad>Greek Caesar</salad>
    </body>
</root>

That's the desired behavior I want.

What I don't understand is why I need a third first_node() call to SAVE the information inside metadata.

If I remove the third first_node() call, which is marked by the ??? above, New.xml will keep the old <metadata> string: "Trying to change this".

Yet, in this scenario, both std::cout calls on metadataNode->value() will still correctly print the intended strings; meaning, the first one will print "Trying to change this" and the second will correctly print "Did the changing".

Why in the world do I need to use n+1 calls to first_node() to SAVE the new value at the desired node where n is the number of nodes traversed (from the root) to get to the desired node? Why is that if I have n first_node() calls, I can successfully modify the value at the desired node in RAM only?

Possible bug? On whose end?


Solution

  • In the XML tree model, text elements are nodes as well. This makes sense when you have mixed content elements: <a>some<b/>text<c/>nodes</a>.

    Basically:

    #include <rapidxml/rapidxml_print.hpp>
    #include <rapidxml/rapidxml_utils.hpp>
    
    int main() {
        rapidxml::file<> xmlFile("Original.xml");
        rapidxml::xml_document<> doc;
        doc.parse<0>(xmlFile.data());
    
        auto root      = doc.first_node();
        auto metadata  = root->first_node();
        auto text_node = metadata->first_node();
    
        text_node->value("Did the changing");
    
        std::ofstream newXmlFile("New.xml");
        newXmlFile << doc;
    }
    

    But wait, there's more

    Sadly, this is a problem unless your input has exactly the expected properties.

    Assuming this ok sample:

    char sample1[] = R"(<root><metadata>Trying to change this</metadata></root>)";
    

    If metadata element were empty, you'd crash:

    char sample2[] = R"(<root><metadata></metadata></root>)";
    char sample3[] = R"(<root><metadata/></root>)";
    

    Indeed this triggers ASAN failures:

    /home/sehe/Projects/stackoverflow/test.cpp:17:25: runtime error: member access within null pointer of type 'struct xml_node'
    /home/sehe/Projects/stackoverflow/test.cpp:17:25: runtime error: member call on null pointer of type 'struct xml_base'
    /usr/include/rapidxml/rapidxml.hpp:762:24: runtime error: member call on null pointer of type 'struct xml_base'
    /usr/include/rapidxml/rapidxml.hpp:753:21: runtime error: member access within null pointer of type 'struct xml_base'
    AddressSanitizer:DEADLYSIGNAL
    

    If there's a surprise, it will .... do surprising things!

    char sample4[] = R"(<root><metadata><surprise/></metadata></root>)";
    

    Ends up erroneously generating:

    <root>
            <metadata>
                    <surprise>changed</surprise>
            </metadata>
    </root>
    

    And that's not the end of it:

    #include <rapidxml/rapidxml_print.hpp>
    #include <rapidxml/rapidxml_utils.hpp>
    #include <iostream>
    
    namespace {
        char sample1[] = R"(<root><metadata>Trying to change this</metadata></root>)";
        char sample2[] = R"(<root><metadata><surprise/></metadata></root>)";
        char sample3[] = R"(<root><metadata>mixed<surprise/>bag</metadata></root>)";
        char sample4[] = R"(<root><metadata><![CDATA[mixed<surprise/>bag]]></metadata></root>)";
        char sample5[] = R"(<root><metadata><!-- comment please -->outloud<!-- hidden --></metadata></root>)";
        //These crash:
        //char sampleX[] = R"(<root><metadata></metadata></root>)";
        //char sampleY[] = R"(<root><metadata/></root>)";
    }
    
    int main() {
        for (char* xml : {sample1, sample2, sample3, sample4, sample5}) {
            std::cout << "\n=== " << xml << " ===\n";
            rapidxml::xml_document<> doc;
            doc.parse<0>(xml);
    
            auto root      = doc.first_node();
            auto metadata  = root->first_node();
            auto text_node = metadata->first_node();
    
            text_node->value("changed");
            print(std::cout << " --> ", doc, rapidxml::print_no_indenting);
            std::cout << "\n";
        }
    }
    

    Prints

    === <root><metadata>Trying to change this</metadata></root> ===
     --> <root><metadata>changed</metadata></root>
    
    === <root><metadata><surprise/></metadata></root> ===
     --> <root><metadata><surprise>changed</surprise></metadata></root>
    
    === <root><metadata>mixed<surprise/>bag</metadata></root> ===
     --> <root><metadata>changed<surprise/>bag</metadata></root>
    
    === <root><metadata><![CDATA[mixed<surprise/>bag]]></metadata></root> ===
     --> <root><metadata><![CDATA[changed]]></metadata></root>
    
    === <root><metadata><!-- comment please -->outloud<!-- hidden --></metadata></root> ===
     --> <root><metadata>changed</metadata></root>
    

    HOW TO GET IT ROBUST?

    PUTTING IT ALL TOGETHER:

    #include <rapidxml/rapidxml.hpp>
    #include <rapidxml/rapidxml_print.hpp>
    #include <rapidxml/rapidxml_utils.hpp>
    #include <iostream>
    
    namespace {
        std::string cases[] = {
         R"(<root><metadata>Trying to change this</metadata></root>)",
         R"(<root><metadata><surprise/></metadata></root>)",
         R"(<root><metadata>mixed<surprise/>bag</metadata></root>)",
         R"(<root><metadata><![CDATA[mixed<surprise/>bag]]></metadata></root>)",
         R"(<root><metadata><!-- comment please -->outloud<!-- hidden --></metadata></root>)",
         R"(<root>
      <metadata>Trying to change this</metadata>
      <body>
        <salad>Greek Caesar</salad>
       </body>
    </root>)",
         //These no longer crash:
         R"(<root><metadata></metadata></root>)",
         R"(<root><metadata/></root>)",
         // more edge-cases in the predecessor chain
         R"(<root></root>)",
         R"(<root><no-metadata/></root>)",
         R"(<bogus/>)",
        };
    }
    
    int main() {
        auto const fmt = rapidxml::print_no_indenting;
        for (auto& xml : cases) {
            std::cout << "Input: " << xml << "\n";
    
            rapidxml::xml_document<> doc;
            doc.parse<0>(xml.data());
    
            if (auto root = doc.first_node()) {
                if (root->name() == std::string("root")) {
                    if (auto* old_meta = root->first_node()) {
                        if (old_meta->name() == std::string("metadata")) {
                            root->remove_first_node();
                        } else {
                            std::cout << "WARNING: Not removing '" << old_meta->name() << "' element where 'metadata' expected\n";
                        }
                    }
    
                    auto newmeta = doc.allocate_node(rapidxml::node_element, "metadata", "changed");
                    root->prepend_node(newmeta);
                } else {
                    std::cout << "WARNING: '" << root->name() << "' found where 'root' expected\n";
                }
            }
            
            print(std::cout << "Output: ", doc, fmt);
            std::cout << "\n--\n";
        }
    }
    

    Prints

    Input: <root><metadata>Trying to change this</metadata></root> ===
    Output: <root><metadata>changed</metadata></root>
    --
    Input: <root><metadata><surprise/></metadata></root> ===
    Output: <root><metadata>changed</metadata></root>
    --
    Input: <root><metadata>mixed<surprise/>bag</metadata></root> ===
    Output: <root><metadata>changed</metadata></root>
    --
    Input: <root><metadata><![CDATA[mixed<surprise/>bag]]></metadata></root> ===
    Output: <root><metadata>changed</metadata></root>
    --
    Input: <root><metadata><!-- comment please -->outloud<!-- hidden --></metadata></root> ===
    Output: <root><metadata>changed</metadata></root>
    --
    Input: <root>
      <metadata>Trying to change this</metadata>
      <body>
        <salad>Greek Caesar</salad>
       </body>
    </root> ===
    Output: <root><metadata>changed</metadata><body><salad>Greek Caesar</salad></body></root>
    --
    Input: <root><metadata></metadata></root> ===
    Output: <root><metadata>changed</metadata></root>
    --
    Input: <root><metadata/></root> ===
    Output: <root><metadata>changed</metadata></root>
    --
    Input: <root></root> ===
    Output: <root><metadata>changed</metadata></root>
    --
    Input: <root><no-metadata/></root> ===
    WARNING: Not removing 'no-metadata' element where 'metadata' expected
    Output: <root><metadata>changed</metadata><no-metadata/></root>
    --
    Input: <bogus/> ===
    WARNING: 'bogus' found where 'root' expected
    Output: <bogus/>
    --
    

    SUMMARY

    XML is extensible. It's Markup. It's Language. It's not simple :)