I'm trying to collect data with x-ray from a page that structured like:
<h1>Page title</h1>
<article>
<h2 id="first">Title 1</h2>
<h3>Subtitle 1</h3>
<ul>
<li>Element 1
<li>Element 2
<li>Element 3
</ul>
<h2 id="second">Title 2</h2>
<h3>Subtitle 2</h3>
<h2 id="third">Title 3</h2>
<h3>Subtitle 3</h3>
<ul>
<li>Element 1
<li>Element 2
<li>Element 3
</ul>
</article>
The article is split in sections with <h2>
. The section contains subtitle and may contain a list of items. My goal is to get an object with the structure:
type Result = {
pageTitle: string,
sections: [{ subtitle?: string, elements?: string[] }],
}
From that example structure I expect output:
{
pageTitle: "Page title",
sections: [
{
subtitle: "Subtitle 1",
elements: ["Element1", "Element2", "Element3"]
},
{
subtitle: "Subtitle 2",
elements: [] //or any falsy value
},
{
subtitle: "Subtitle 3",
elements: ["Element1", "Element2", "Element3"]
}
]
}
I've tried:
xray(url, {
pageTitle: "h1 | trim", //where trim is defined filter
sections: xray("article", [{
subtitle: "h3",
elements: ['h3 ~ ul li']
}])
})
But I've figured out that it doesn't work as expected because there is only one article
tag on the page and []
indicates that xray will iterate over whatever selector (article
in my case) returns
I've also tried:
xray(url, {
pageTitle: "h1 | trim", //where trim is defined filter
sections: xray("h2", [{
subtitle: "h3",
elements: ['h3 ~ ul li']
}])
})
This returns 0 results, probably because xray("h2", /* other code */)
"scopes" selection to only h2 and nothing else. And my h2's doesn't contain nested elements.
So is there a way to get array of objects from a non-nested html structure?
X-ray library doesn't provide an easy way to capture sibling elements within its syntax. It primarily works with a parent-child relationship and the structure you're trying to scrape doesn't conform to that pattern.
Ideally library like Puppeteer would be more suitable, but x-ray with jsdom can handle it too.
The solution is to pre-process the HTML to encapsulate each section within a separate container, then scrape that new structure with x-ray.
Steps:
Code:
const { JSDOM } = require("jsdom");
const xray = require("x-ray")();
const html = /* Your HTML here */;
const dom = new JSDOM(html);
const document = dom.window.document;
let currentDiv;
document.querySelectorAll("h2").forEach((h2, index) => {
if (index === 0) {
currentDiv = document.createElement("div");
h2.parentNode.insertBefore(currentDiv, h2);
} else {
currentDiv = document.createElement("div");
currentDiv.appendChild(document.createElement("br")); // Separator
h2.parentNode.insertBefore(currentDiv, h2);
}
let sibling = h2;
do {
currentDiv.appendChild(sibling);
sibling = sibling.nextElementSibling;
} while (sibling && sibling.tagName !== "H2");
});
const processedHtml = document.body.innerHTML;
xray(processedHtml, {
pageTitle: "h1 | trim",
sections: xray("div", [{
subtitle: "h3",
elements: ["ul li"]
}])
})((err, result) => {
console.log(result);
});
It makes extensive use of DOM manipulation which may not be ideal