[SOLVED] Parsing html text out of a content node using libxml++ returns empty output

Parsing html text out of a content node using libxml++ returns empty output

I modified the DOM parsing example found here: https://libxmlplusplus.github.io/libxmlplusplus/manual/html/chapter-parsers.html#sect-dom-parser

The code:

#include <libxml++/libxml++.h>
#include <iostream>
#include <cstdlib>

int main(int argc, char* argv[])
{
  std::string filepath = "theHtml.html";

  try
  {
    xmlpp::DomParser parser;
    parser.parse_file(filepath);

    const auto pNode = parser.get_document()->get_root_node();
    for(const auto& child : pNode->get_children())
    {
        const auto nodeText = dynamic_cast<const xmlpp::TextNode*>(child);

        if(nodeText)
            {
              std::cout << "Text Node" << std::endl;
              std::cout << nodeText->get_content();
            }
    }
  }
  catch(const std::exception& ex)
  {
    std::cerr << "Exception caught: " << ex.what() << std::endl;
    return EXIT_FAILURE;
  }

  return EXIT_SUCCESS;
}

The HTML:

<!DOCTYPE html>
<html lang="en">
<head>
<title>The title</title>
</head>

<body>

<h1>My First Heading</h1>
<p>My first paragraph.</p>

</body>
</html>

And the output:

Text Node

Text Node


Text Node

21:05:13: qtProject exited with code 0

Seems like the get_content() function isn't working correctly in my modified code. It returns empty output when there should be text. The unmodified example compiles and returns text located in the html file. Documentation: https://fossies.org/dox/libxml++-5.0.2/classxmlpp_1_1ContentNode.html

Edit This seems to work though:

#include <libxml++/libxml++.h>
#include <iostream>
#include <cstdlib>

void print_node(const xmlpp::Node* node)
{
  const auto nodeText = dynamic_cast<const xmlpp::TextNode*>(node);

  if(nodeText && nodeText->is_white_space())
    return;

  if(nodeText)
  {
    std::cout << "Text Node" << std::endl;
    std::cout << "text = \"" << nodeText->get_content() << "\"" << std::endl;
  }

    //Recurse through child nodes:
    for(const auto& child : node->get_children())
    {
      print_node(child);
    }

}

int main(int argc, char* argv[])
{
  std::string filepath;
  filepath = "theHtml.html";

  try
  {
    xmlpp::DomParser parser;
    parser.parse_file(filepath);

    if(parser)
    {
      //Walk the tree:
      const auto pNode = parser.get_document()->get_root_node();
      print_node(pNode);
    }
  }
  catch(const std::exception& ex)
  {
    std::cerr << "Exception caught: " << ex.what() << std::endl;
    return EXIT_FAILURE;
  }

  return EXIT_SUCCESS;
}

Output:

Text Node
text = "The title"
Text Node
text = "My First Heading"
Text Node
text = "My first paragraph."
22:56:58: qtProject exited with code 0

Solution

Your non-working version is finding text nodes, but the text nodes it is finding are just the whitespace between the top level elements of your HTML. It doesn't descend far enough into the tree to find the 'real' text nodes. So get_content() is working, it's just that the content it is finding is all spaces and tabs and newlines.

You could change your code to output this

// output text with delimiters
std::cout << '|' << nodeText->get_content() << '|' << std::endl;

to see exactly what text your code is finding.

Your second working version is recursive so it does scan the entire tree and so it does find all the text.