javascript regex dom full-text-search textnode

How can one enable/implement a full text search of any of a DOM node's or the entire document's text-content?

I've stumbled upon a technical test in which i have to count occurrences of words inside an HTML body. I have to avoid script tags and comments, these are the unique conditions from the problematic. And i have to successfully pass the tests of a chai file that looks like that.

Here is the plunker's test in which this problem occurred: https://plnkr.co/edit/tpl:lHY8ZJu6RAs6Sxee?p=preview&preview

The HTML file :

<body>
  <div id="text">
    Hello <strong>world</strong>
    <p class="world">This is the p 1</p>
    <p class="rabbit">This is the p 2</p>
    <p>This is the p 3</p>
    <p>Hello world !</p>
    <!-- <p>Not displayed hello world</p> -->
    <p>It's 9 o'clock, I will send you an e-mail.</p>
    <p>Is this the 'main'?</p>
    <p>This is a multiline paragraph</p>
    <pre>
      <div>This is some math in HTML</div>
      const n = 2;
      if (1 < n && n > 4) console.log(n);
    </pre>
    <ul>
      <li>This is right</li>
      <li>This is a copyright</li>
    </ul>
    <script>
      console.log('Hello world');
    </script>
  </div>

  <!-- Include the function file -->
  <script src="script.js"></script>
  <!-- Include the test file -->
  <script src="/test/test.js"></script>

  <div id="mocha"></div>
  <script>
    mocha.setup('bdd');
  </script>
  <script>
    mocha.run();
  </script>
</body>

Chai.js testing file:

describe('countOccurence', function() {
  it('count hello world', function() {
    assert.equal(countOccurence('Hello world'), 1);
    assert.equal(countOccurence('hello world'), 1);
    assert.equal(countOccurence('Hello World'), 1);
    assert.equal(countOccurence('Hello'), 2);
    assert.equal(countOccurence('world'), 2);
  });

  it('count p', function() {
    assert.equal(countOccurence('p'), 3);
  });

  it('count right', function() {
    assert.equal(countOccurence('right'), 1);
  });

  it('count n', function() {
    assert.equal(countOccurence('n'), 4);
  });

  it('count log', function() {
    assert.equal(countOccurence('log'), 1);
    assert.equal(countOccurence('console.log'), 1);
  });

  it('count clock', function() {
    assert.equal(countOccurence("o'clock"), 1);
    assert.equal(countOccurence('clock'), 0);
  });

  it('count e-mail', function() {
    assert.equal(countOccurence('e-mail'), 1);
  });

  it('count sentence', function() {
    assert.equal(countOccurence('This is a multiline paragraph'), 1);
    assert.equal(countOccurence("Hello world ! It's 9 o'clock"), 0);
  });
});

To solve this problem, i've created a function to extract the content of the HTML file and get rid of the scripts and comments.

function extractHTMLContent() {
  const bodyContent = document.body.innerHTML;

  const noScriptContent = bodyContent.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, '');

  const noCommentContent = noScriptContent.replace(/<!--[\s\S]*?-->/g, '');

  const textContent = noCommentContent.replace(/<\/?[^>]+(>|$)/g, ' ');

  return textContent;
}

and then i use this function :

function countOccurence(phrase) {

  if (typeof phrase !== 'string') {
    throw new TypeError('Phrase needs to be a string/');
  }

  const bodyContent = extractHTMLContent();

  const normalizedText = bodyContent.toLowerCase();

  const escapedPhrase = phrase.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');

  const regex = new RegExp(`\\b${escapedPhrase}\\b`, 'gi');

  console.log("Regex: ", regex);


  const matches = normalizedText.match(regex);
  console.log(matches);
  return matches ? matches.length : 0;
}

And after that, i try to check if the prop passed to the countOccurrence function is a string, normalize the text in lower case, escapes special characters, constructs a regular expression to match only the passed string of the function and then try to check all matches inside the HTML file.

I can't manage to passe the "count clock" and "count sentence" tests and i feel like the apostrophes are part of the problem inside the sentence test and that my regex is not working as i expect and is still taking the clock string in account even though it's not meant to.

If some of you have some suggestions, i'd be really glad to hear them

Solution

The approach is mainly two-fold:

extract all text content into a unified searchable string while preserving the correct text flow (order).
create a regex from the provided search/query string where the text needs to be a) unified according to the unifier rules of the before extracted text content and b) also needs to get escaped all occurring regex specific characters.

As for 1), in order to preserve the correct flow order of any text within a certain element-node, one has to implement a recursion based approach that collects all non empty text nodes according to their natural position within the document. Within a mapping task one, for each text node, would collapse any of its text-value's whitespace-sequence into a single whitespace and trim it. One finally would join the array of unified strings into a single searchable string.

As for 2), the OP's countOccurrence function needs to be refactored into a function which expects a searchable string value as its first and either a regex or a string-based search/query as its second parameter. The other necessary sub-tasks have been mentioned already by 2a) and 2b).

// element- and text-node spcific detection-helpers.

function isNonScriptElementNode(node) {
  return (
    node.nodeType === 1 &&
    node.tagName.toLowerCase() !== 'script'
  );
}
function isNonEmptyTextNode(node) {
  return (
       (node.nodeType === 3)
    && (node.parentNode.tagName.toLowerCase() !== 'script')
    && (node.nodeValue.trim() !== '')
  );
}

// recursive text-node specific reducer-functionality.
function collectNonEmptyTextNodeList(node) {
  const result = [];

  if (isNonScriptElementNode(node)) {

    result.push(
      ...[...node.childNodes].reduce((list, childNode) =>

        list.concat(collectNonEmptyTextNodeList(childNode)), []
      )
    );
  } else if (isNonEmptyTextNode(node)) {

    result.push(node)
  }
  return result;
}

// the OP's newly implemented occurence-count function.
function countOccurrence(text, stringOrRegExp) {
  const escapeSearch = value =>
    value.replace(/\s+/g, ' ').trim().replace(/[-[\]{}()*+?.,\\^$|#]/g, '\\$&');
debugger;
  const regXSearch = stringOrRegExp?.test
    && stringOrRegExp
    || RegExp(`\\b${ escapeSearch(String(stringOrRegExp)) }\\b`, 'g');

  return (text.match(regXSearch) ?? []).length;
}


const textNodeList = collectNonEmptyTextNodeList(document.body);

const textContent = textNodeList
  .map(node =>node.textContent.replace(/\s+/g, ' ').trim())
  .join(' ');


console.log({ textContent });

console.log(
  "hello world' count ...", countOccurrence(textContent, 'hello world'), // 0
);
console.log(
  "'Hello world' count ...", countOccurrence(textContent, 'Hello world'), // 1
);
console.log(
  "'Hello World' count ...", countOccurrence(textContent, 'Hello World'), // 1
);
console.log(
  "\/hello world\/ig' count ...", countOccurrence(textContent, /hello world/ig), // 2
);
console.log('\n');

console.log(
  "'Hello' count ...", countOccurrence(textContent, 'Hello'), // 2
);
console.log(
  "'world' count ...", countOccurrence(textContent, 'world'), // 1
);
console.log(
  "'World' count ...", countOccurrence(textContent, 'World'), // 1
);
console.log(
  "\/world\/ig count ...", countOccurrence(textContent, /world/ig), // 2
);
console.log('\n');

console.log(
  '"o\'clock" count ...', countOccurrence(textContent, "o'clock"), // 1
);
console.log(
  "'This is a multiline paragraph' count ...", countOccurrence(textContent, 'This is a multiline paragraph'), // 1
);
const search = `Hello World !

It's 9 o'clock`;

console.log(
  `"${ search }" count ...`, countOccurrence(textContent, search), // 1
);

.as-console-wrapper { left: auto!important; width: 50%; min-height: 100%; }

<div id="text">
  Hello <strong>world</strong>
  <p class="world">This is the p 1</p>
  <p class="rabbit">This is the p 2</p>
  <p>This is the p 3</p>
  <p>Hello World !</p>
  <!-- <p>Not displayed hello world</p> -->
  <p>It's 9 o'clock, I will send you an e-mail.</p>
  <p>Is this the 'main'?</p>
  <p>This is a multiline paragraph</p>
  <pre>
    <div>This is some math in HTML</div>
    const n = 2;
    if (1 &lt; n && n &gt; 4) console.log(n);
  </pre>
  <ul>
    <li>This is right</li>
    <li>This is a copyright</li>
  </ul>
  <script>
    console.log('Hello world');
  </script>
</div>