javascriptnode.jsalgorithmcheeriodomparser

How to extract text from between any given pair of spans?


I am trying to use Cheerio and Node.js to extract text from an interesting bit of HTML.

Let's say I have the following HTML:

<p>
  <span class="sectionno" id="s1">1</span>
  Do you see that shelf?
  <span class="endsection"></span>
  <span class="sectionno" id="s2">2</span>The shelf is hanging
</p>
<p>on the wall</p>
<p>beside the clock.</p>
<h3>Title Here</h3>
<span class="endsection"></span>
<p>
  <span class="sectionno" id="s3">3</span>The clock
</p>
<p>was ticking slowly</p>
<p>telling time<span class="endsection"></span></p>

I want to be able to extract the following data, getting the text between each pair of span.sectionno and span.endsection:

[
  {
    no: 1,
    text: "Do you see that shelf?",
  },
  {
    no: 2,
    text: "The shelf is hanging on the wall, beside the clock.",
  },
  {
    no: 3,
    text: "The clock was ticking slowly telling time",
  },
]

Notice that I want to ignore any text in headings.‌‌‌ I tried things like this but I this just gives me the numbers at the beginning of each section:

const $ = cheerio.load(html);
const sections = [];

$("span.sectionno").each((_, el) => {
  const sectionNo = parseInt($(el).text());
  const text = $(el).nextUntil("span.endsection").addBack().text();
  sections.push({ no: sectionNo, text: text.trim() });
});

console.log(sections);
// [ { no: 1, text: '1' }, { no: 2, text: '2' }, { no: 3, text: '3' } ]

Because of the strange setup of the HTML I have been unable to successfully do this with Cheerio.


Solution

  • Any good generic approach should consist of mainly 3 steps.

    One first has to parse a document from the provided markup string, like with e.g. ...

    const doc = new DOMParser()
      .parseFromString(markup, 'text/html');
    

    Then one needs to query all sectionno classified element-nodes, like with e.g. ...

    const sectionStartNodeList = doc.body
      .querySelectorAll('.sectionno');
    

    The main task of aggregating a text-content item for each available section-start node gets achieved by a simple tree-walking process.

    For each such entry-point one starts with extracting the item-count (no) of the to be created and returned text-item object. The very item's text property-value then gets aggregated by proceeding with the nextSibling of the currently processed node (either text-node or element-node). In case there is neither a next sibling nor an immediate match with an element-node that marks a section's end, one has to switch to this last node's parentNode's next sibling. Thats all what's needed for a successful tree walking.

    In case the above described function has been named extractSectionTextContent, it can be applied directly via a map task which iterates the array-form of the before queried node-list ...

    const sectionContentList = [...sectionStartNodeList]
      .map(extractSectionTextContent);
    

    ... example code ...

    const markup = `
      <p>
        <span class="sectionno" id="s1">1</span>
        Do you see that shelf?
        <span class="endsection"></span>
        <span class="sectionno" id="s2">2</span>The shelf is hanging
      </p>
      <p>on the wall</p>
      <p>beside the clock.</p>
      <h3>Title Here</h3>
      <span class="endsection"></span>
      <p>
        <span class="sectionno" id="s3">3</span>The clock
      </p>
      <p>was ticking slowly</p>
      <p>telling time<span class="endsection"></span></p>
    `;
    const docBody = new DOMParser()
      .parseFromString(markup, 'text/html')
      .body;
    
    const sectionStartNodeList = docBody
      .querySelectorAll('.sectionno');
    
    console.log({ sectionStartNodeList: [...sectionStartNodeList] });
    
    const sectionContentList = [...sectionStartNodeList]
      .map(extractSectionTextContent);
    
    console.log({ sectionContentList });
    .as-console-wrapper { bottom: auto; right: auto; top: 0; min-height: 100%; }
    <script>
    function extractSectionTextContent(node) {
    
      const contentList = [];
      const textItemCount = node.textContent.trim();
    
      let textValue;
    
      while (
        (node = node.nextSibling || node.parentNode.nextSibling) &&
        !node.classList?.contains('endsection')
      ) {
        if (node.nodeType === Node.TEXT_NODE) {
    
          textValue = node.nodeValue.trim();
    
        } else if (
          (node.nodeType === Node.ELEMENT_NODE) &&
    
          // OP ... "Notice that I want to ignore any text in headings."
          !/^h[1-6]$/.test(node.tagName.toLowerCase())
        ) {
    
          textValue = node.textContent.trim();
        }
        if (textValue) {
          contentList.push(textValue);
        }
      }
    
      return {
        no: textItemCount,
        text: contentList.join(' '), 
      };
    }
    </script>

    Edit ... regarding the next quoted follow-up comments after having provided the above solution ...

    This is nice! But this runs in the browser, I am trying to do this in node and it doesn't seem to quite work with using jsdom instead? – Adam D

    @AdamD ... everything provided above runs in node.js too. What you have to look for is a DOMParser like node package/module or make use of e.g. the jsdom package. – Peter Seliger

    The jsdom library fails at traversing a DOM-like model as it is required for any c/lean solution to the OP's problem. But ershov-konst's dom-parser package provides some basic dom-walking capability.

    Thus the next provided code can be run in a node.js-environment.

    The first introduced approach can be kept entirely. Just some implementation details have to be changed slightly in order to reflect the model-differences which are introduced by the dom-parser library.

    This library for instance does not support a DOM-node's nextSibling property, thus, one has to implement and utilize an own getNextSibling function that works upon any node's parentNode's childNodes-array which both are dom-parser supported properties.

    ... example code, capable of being executed within a node.js environment ...

    const markup = `
      <p>
        <span class="sectionno" id="s1">1</span>
        Do you see that shelf?
        <span class="endsection"></span>
        <span class="sectionno" id="s2">2</span>The shelf is hanging
      </p>
      <p>on the wall</p>
      <p>beside the clock.</p>
      <h3>Title Here</h3>
      <span class="endsection"></span>
      <p>
        <span class="sectionno" id="s3">3</span>The clock
      </p>
      <p>was ticking slowly</p>
      <p>telling time<span class="endsection"></span></p>
    `;
    function main(markup) {
    
      const domParserRoot = domParser
        .parseFromString(`<div>${ markup }</div>`);
    
      const sectionStartNodeList = domParserRoot
        .getElementsByClassName('sectionno');
    
      console.log({ sectionStartNodeList });
    
      const sectionContentList = [...sectionStartNodeList]
        .map(extractSectionTextContent);
    
      console.log({ sectionContentList });
    }
    document
      .addEventListener('DOMContentLoaded', () => main(markup));
    .as-console-wrapper { bottom: auto; right: auto; top: 0; min-height: 100%; }
    <script type="module">
      import * as domParser from 'https://cdn.jsdelivr.net/npm/dom-parser@1.1.5/+esm';
      
      window.domParser = domParser;
    </script>
    
    <script>
    function getNextSibling(node) {
      const siblingNodes = node.parentNode?.childNodes ?? [];
    
      return siblingNodes
        .at(siblingNodes.indexOf(node) + 1) ?? null;
    }
    function extractSectionTextContent(node) {
    
      const contentList = [];
      const textItemCount = node.textContent.trim();
    
      let classAttr;
      let textValue;
    
      while (
        (node = getNextSibling(node) || getNextSibling(node.parentNode)) &&
        (classAttr = node.attributes.find(({ name }) => name === 'class') ?? {}) &&
        !/\bendsection\b/.test(classAttr.value ?? '')
      ) {
        if (node.nodeType === 3) {
    
          textValue = node.text.trim();
    
        } else if (
          (node.nodeType === 1) &&
    
          // OP ... "Notice that I want to ignore any text in headings."
          !/^h[1-6]$/.test(node.nodeName)
        ) {
    
          textValue = node.textContent.trim();
        }
        if (textValue) {
          contentList.push(textValue);
        }
      }
    
      return {
        no: textItemCount,
        text: contentList.join(' '), 
      };
    }
    </script>