javascript typescript web-scraping x-ray

X-ray. How to parse non-nested structure into array of objects?

I'm trying to collect data with x-ray from a page that structured like:

<h1>Page title</h1>
<article>
  <h2 id="first">Title 1</h2>
  <h3>Subtitle 1</h3>
  <ul>
    <li>Element 1
    <li>Element 2
    <li>Element 3
  </ul>
  <h2 id="second">Title 2</h2>
  <h3>Subtitle 2</h3>
  <h2 id="third">Title 3</h2>
  <h3>Subtitle 3</h3>
  <ul>
    <li>Element 1
    <li>Element 2
    <li>Element 3
  </ul>
</article>

The article is split in sections with <h2>. The section contains subtitle and may contain a list of items. My goal is to get an object with the structure:

type Result = { 
  pageTitle: string,
  sections: [{ subtitle?: string, elements?: string[] }],
}

From that example structure I expect output:

{
  pageTitle: "Page title",
  sections: [
    {
      subtitle: "Subtitle 1",
      elements: ["Element1", "Element2", "Element3"]
    },
    {
      subtitle: "Subtitle 2",   
      elements: [] //or any falsy value
    },
    {
      subtitle: "Subtitle 3",
      elements: ["Element1", "Element2", "Element3"]
    }
  ]
}

I've tried:

xray(url, {
  pageTitle: "h1 | trim", //where trim is defined filter
  sections: xray("article", [{
    subtitle: "h3",
    elements: ['h3 ~ ul li']
  }])
})

But I've figured out that it doesn't work as expected because there is only one article tag on the page and [] indicates that xray will iterate over whatever selector (article in my case) returns

I've also tried:

xray(url, {
  pageTitle: "h1 | trim", //where trim is defined filter
  sections: xray("h2", [{
    subtitle: "h3",
    elements: ['h3 ~ ul li']
  }])
})

This returns 0 results, probably because xray("h2", /* other code */) "scopes" selection to only h2 and nothing else. And my h2's doesn't contain nested elements.

So is there a way to get array of objects from a non-nested html structure?

Solution

X-ray library doesn't provide an easy way to capture sibling elements within its syntax. It primarily works with a parent-child relationship and the structure you're trying to scrape doesn't conform to that pattern.

Ideally library like Puppeteer would be more suitable, but x-ray with jsdom can handle it too.

The solution is to pre-process the HTML to encapsulate each section within a separate container, then scrape that new structure with x-ray.

Steps:

Load the HTML into a jsdom
Iterate over each element, collect and siblings until the next
Wrap each group in a new
Pass the HTML to x-ray

Code:

    const { JSDOM } = require("jsdom");
    const xray = require("x-ray")();
    
    const html = /* Your HTML here */;
    
    const dom = new JSDOM(html);
    const document = dom.window.document;
    
    let currentDiv;
    document.querySelectorAll("h2").forEach((h2, index) => {
      if (index === 0) {
        currentDiv = document.createElement("div");
        h2.parentNode.insertBefore(currentDiv, h2);
      } else {
        currentDiv = document.createElement("div");
        currentDiv.appendChild(document.createElement("br")); // Separator
        h2.parentNode.insertBefore(currentDiv, h2);
      }
    
      let sibling = h2;
      do {
        currentDiv.appendChild(sibling);
        sibling = sibling.nextElementSibling;
      } while (sibling && sibling.tagName !== "H2");
    });
    
    const processedHtml = document.body.innerHTML;
    
    xray(processedHtml, {
      pageTitle: "h1 | trim",
      sections: xray("div", [{
        subtitle: "h3",
        elements: ["ul li"]
      }])
    })((err, result) => {
      console.log(result);
    });

It makes extensive use of DOM manipulation which may not be ideal