I am trying to use Cheerio and Node.js to extract text from an interesting bit of HTML.
Let's say I have the following HTML:
<p>
<span class="sectionno" id="s1">1</span>
Do you see that shelf?
<span class="endsection"></span>
<span class="sectionno" id="s2">2</span>The shelf is hanging
</p>
<p>on the wall</p>
<p>beside the clock.</p>
<h3>Title Here</h3>
<span class="endsection"></span>
<p>
<span class="sectionno" id="s3">3</span>The clock
</p>
<p>was ticking slowly</p>
<p>telling time<span class="endsection"></span></p>
I want to be able to extract the following data, getting the text between each pair of span.sectionno
and span.endsection
:
[
{
no: 1,
text: "Do you see that shelf?",
},
{
no: 2,
text: "The shelf is hanging on the wall, beside the clock.",
},
{
no: 3,
text: "The clock was ticking slowly telling time",
},
]
Notice that I want to ignore any text in headings. I tried things like this but I this just gives me the numbers at the beginning of each section:
const $ = cheerio.load(html);
const sections = [];
$("span.sectionno").each((_, el) => {
const sectionNo = parseInt($(el).text());
const text = $(el).nextUntil("span.endsection").addBack().text();
sections.push({ no: sectionNo, text: text.trim() });
});
console.log(sections);
// [ { no: 1, text: '1' }, { no: 2, text: '2' }, { no: 3, text: '3' } ]
Because of the strange setup of the HTML I have been unable to successfully do this with Cheerio.
Any good generic approach should consist of mainly 3 steps.
One first has to parse a document from the provided markup
string, like with e.g. ...
const doc = new DOMParser()
.parseFromString(markup, 'text/html');
Then one needs to query all sectionno
classified element-nodes, like with e.g. ...
const sectionStartNodeList = doc.body
.querySelectorAll('.sectionno');
The main task of aggregating a text-content item for each available section-start node gets achieved by a simple tree-walking process.
For each such entry-point one starts with extracting the item-count (no
) of the to be created and returned text-item object. The very item's text
property-value then gets aggregated by proceeding with the nextSibling
of the currently processed node (either text-node or element-node). In case there is neither a next sibling nor an immediate match with an element-node that marks a section's end, one has to switch to this last node's parentNode
's next sibling. Thats all what's needed for a successful tree walking.
In case the above described function has been named extractSectionTextContent
, it can be applied directly via a map
task which iterates the array-form of the before queried node-list ...
const sectionContentList = [...sectionStartNodeList]
.map(extractSectionTextContent);
... example code ...
const markup = `
<p>
<span class="sectionno" id="s1">1</span>
Do you see that shelf?
<span class="endsection"></span>
<span class="sectionno" id="s2">2</span>The shelf is hanging
</p>
<p>on the wall</p>
<p>beside the clock.</p>
<h3>Title Here</h3>
<span class="endsection"></span>
<p>
<span class="sectionno" id="s3">3</span>The clock
</p>
<p>was ticking slowly</p>
<p>telling time<span class="endsection"></span></p>
`;
const docBody = new DOMParser()
.parseFromString(markup, 'text/html')
.body;
const sectionStartNodeList = docBody
.querySelectorAll('.sectionno');
console.log({ sectionStartNodeList: [...sectionStartNodeList] });
const sectionContentList = [...sectionStartNodeList]
.map(extractSectionTextContent);
console.log({ sectionContentList });
.as-console-wrapper { bottom: auto; right: auto; top: 0; min-height: 100%; }
<script>
function extractSectionTextContent(node) {
const contentList = [];
const textItemCount = node.textContent.trim();
let textValue;
while (
(node = node.nextSibling || node.parentNode.nextSibling) &&
!node.classList?.contains('endsection')
) {
if (node.nodeType === Node.TEXT_NODE) {
textValue = node.nodeValue.trim();
} else if (
(node.nodeType === Node.ELEMENT_NODE) &&
// OP ... "Notice that I want to ignore any text in headings."
!/^h[1-6]$/.test(node.tagName.toLowerCase())
) {
textValue = node.textContent.trim();
}
if (textValue) {
contentList.push(textValue);
}
}
return {
no: textItemCount,
text: contentList.join(' '),
};
}
</script>
Edit ... regarding the next quoted follow-up comments after having provided the above solution ...
This is nice! But this runs in the browser, I am trying to do this in node and it doesn't seem to quite work with using
jsdom
instead? – Adam D@AdamD ... everything provided above runs in node.js too. What you have to look for is a
DOMParser
like node package/module or make use of e.g. thejsdom
package. – Peter Seliger
The jsdom
library fails at traversing a DOM-like model as it is required for any c/lean solution to the OP's problem. But ershov-konst's dom-parser
package provides some basic dom-walking capability.
Thus the next provided code can be run in a node.js
-environment.
The first introduced approach can be kept entirely. Just some implementation details have to be changed slightly in order to reflect the model-differences which are introduced by the dom-parser
library.
This library for instance does not support a DOM-node's nextSibling
property, thus, one has to implement and utilize an own getNextSibling
function that works upon any node's parentNode
's childNodes
-array which both are dom-parser
supported properties.
... example code, capable of being executed within a node.js
environment ...
const markup = `
<p>
<span class="sectionno" id="s1">1</span>
Do you see that shelf?
<span class="endsection"></span>
<span class="sectionno" id="s2">2</span>The shelf is hanging
</p>
<p>on the wall</p>
<p>beside the clock.</p>
<h3>Title Here</h3>
<span class="endsection"></span>
<p>
<span class="sectionno" id="s3">3</span>The clock
</p>
<p>was ticking slowly</p>
<p>telling time<span class="endsection"></span></p>
`;
function main(markup) {
const domParserRoot = domParser
.parseFromString(`<div>${ markup }</div>`);
const sectionStartNodeList = domParserRoot
.getElementsByClassName('sectionno');
console.log({ sectionStartNodeList });
const sectionContentList = [...sectionStartNodeList]
.map(extractSectionTextContent);
console.log({ sectionContentList });
}
document
.addEventListener('DOMContentLoaded', () => main(markup));
.as-console-wrapper { bottom: auto; right: auto; top: 0; min-height: 100%; }
<script type="module">
import * as domParser from 'https://cdn.jsdelivr.net/npm/dom-parser@1.1.5/+esm';
window.domParser = domParser;
</script>
<script>
function getNextSibling(node) {
const siblingNodes = node.parentNode?.childNodes ?? [];
return siblingNodes
.at(siblingNodes.indexOf(node) + 1) ?? null;
}
function extractSectionTextContent(node) {
const contentList = [];
const textItemCount = node.textContent.trim();
let classAttr;
let textValue;
while (
(node = getNextSibling(node) || getNextSibling(node.parentNode)) &&
(classAttr = node.attributes.find(({ name }) => name === 'class') ?? {}) &&
!/\bendsection\b/.test(classAttr.value ?? '')
) {
if (node.nodeType === 3) {
textValue = node.text.trim();
} else if (
(node.nodeType === 1) &&
// OP ... "Notice that I want to ignore any text in headings."
!/^h[1-6]$/.test(node.nodeName)
) {
textValue = node.textContent.trim();
}
if (textValue) {
contentList.push(textValue);
}
}
return {
no: textItemCount,
text: contentList.join(' '),
};
}
</script>