javascripttypescriptweb-scrapingx-ray

X-ray. How to parse non-nested structure into array of objects?


I'm trying to collect data with x-ray from a page that structured like:

<h1>Page title</h1>
<article>
  <h2 id="first">Title 1</h2>
  <h3>Subtitle 1</h3>
  <ul>
    <li>Element 1
    <li>Element 2
    <li>Element 3
  </ul>
  <h2 id="second">Title 2</h2>
  <h3>Subtitle 2</h3>
  <h2 id="third">Title 3</h2>
  <h3>Subtitle 3</h3>
  <ul>
    <li>Element 1
    <li>Element 2
    <li>Element 3
  </ul>
</article>

The article is split in sections with <h2>. The section contains subtitle and may contain a list of items. My goal is to get an object with the structure:

type Result = { 
  pageTitle: string,
  sections: [{ subtitle?: string, elements?: string[] }],
}

From that example structure I expect output:

{
  pageTitle: "Page title",
  sections: [
    {
      subtitle: "Subtitle 1",
      elements: ["Element1", "Element2", "Element3"]
    },
    {
      subtitle: "Subtitle 2",   
      elements: [] //or any falsy value
    },
    {
      subtitle: "Subtitle 3",
      elements: ["Element1", "Element2", "Element3"]
    }
  ]
}

I've tried:

xray(url, {
  pageTitle: "h1 | trim", //where trim is defined filter
  sections: xray("article", [{
    subtitle: "h3",
    elements: ['h3 ~ ul li']
  }])
})

But I've figured out that it doesn't work as expected because there is only one article tag on the page and [] indicates that xray will iterate over whatever selector (article in my case) returns

I've also tried:

xray(url, {
  pageTitle: "h1 | trim", //where trim is defined filter
  sections: xray("h2", [{
    subtitle: "h3",
    elements: ['h3 ~ ul li']
  }])
})

This returns 0 results, probably because xray("h2", /* other code */) "scopes" selection to only h2 and nothing else. And my h2's doesn't contain nested elements.

So is there a way to get array of objects from a non-nested html structure?


Solution

  • X-ray library doesn't provide an easy way to capture sibling elements within its syntax. It primarily works with a parent-child relationship and the structure you're trying to scrape doesn't conform to that pattern.

    Ideally library like Puppeteer would be more suitable, but x-ray with jsdom can handle it too.

    The solution is to pre-process the HTML to encapsulate each section within a separate container, then scrape that new structure with x-ray.

    Steps:

    1. Load the HTML into a jsdom
    2. Iterate over each element, collect and siblings until the next
    3. Wrap each group in a new
    4. Pass the HTML to x-ray

    Code:

        const { JSDOM } = require("jsdom");
        const xray = require("x-ray")();
        
        const html = /* Your HTML here */;
        
        const dom = new JSDOM(html);
        const document = dom.window.document;
        
        let currentDiv;
        document.querySelectorAll("h2").forEach((h2, index) => {
          if (index === 0) {
            currentDiv = document.createElement("div");
            h2.parentNode.insertBefore(currentDiv, h2);
          } else {
            currentDiv = document.createElement("div");
            currentDiv.appendChild(document.createElement("br")); // Separator
            h2.parentNode.insertBefore(currentDiv, h2);
          }
        
          let sibling = h2;
          do {
            currentDiv.appendChild(sibling);
            sibling = sibling.nextElementSibling;
          } while (sibling && sibling.tagName !== "H2");
        });
        
        const processedHtml = document.body.innerHTML;
        
        xray(processedHtml, {
          pageTitle: "h1 | trim",
          sections: xray("div", [{
            subtitle: "h3",
            elements: ["ul li"]
          }])
        })((err, result) => {
          console.log(result);
        });
    

    It makes extensive use of DOM manipulation which may not be ideal