javascriptstringdomtexttextnode

How to get text from all descendents of an element, disregarding scripts?


My current project involves gathering text content from an element and all of its descendants, based on a provided selector.

For example, when supplied the selector #content and run against this HTML:

<div id="content">
  <p>This is some text.</p>
  <script type="text/javascript">
    var test = true;
  </script>
  <p>This is some more text.</p>
</div>

my script would return (after a little whitespace cleanup):

This is some text. var test = true; This is some more text.

However, I need to disregard text nodes that occur within <script> elements.

This is an excerpt of my current code (technically, it matches based on one or more provided selectors):

// get text content of all matching elements
for (x = 0; x < selectors.length; x++) { // 'selectors' is an array of CSS selectors from which to gather text content
  matches = Sizzle(selectors[x], document);
  for (y = 0; y < matches.length; y++) {
    match = matches[y];
    if (match.innerText) { // IE
      content += match.innerText + ' ';
    } else if (match.textContent) { // other browsers
      content += match.textContent + ' ';
    }
  }
}

It's a bit simplistic in that it just returns all text nodes within the element (and its descendants) that matches the provided selector. The solution I'm looking for would return all text nodes except for those that fall within <script> elements. It doesn't need to be especially high-performance, but I do need it to ultimately be cross-browser compatible.

I'm assuming that I'll need to somehow loop through all children of the element that matches the selector and accumulate all text nodes other than ones within <script> elements; it doesn't look like there's any way to identify JavaScript once it's already rolled into the string accumulated from all of the text nodes.

I can't use jQuery (for performance/bandwidth reasons), although you may have noticed that I do use its Sizzle selector engine, so jQuery's selector logic is available.


Solution

  • function getTextContentExceptScript(element) {
        var text= [];
        for (var i= 0, n= element.childNodes.length; i<n; i++) {
            var child= element.childNodes[i];
            if (child.nodeType===1 && child.tagName.toLowerCase()!=='script')
                text.push(getTextContentExceptScript(child));
            else if (child.nodeType===3)
                text.push(child.data);
        }
        return text.join('');
    }
    

    Or, if you are allowed to change the DOM to remove the <script> elements (which wouldn't usually have noticeable side effects), quicker:

    var scripts= element.getElementsByTagName('script');
    while (scripts.length!==0)
        scripts[0].parentNode.removeChild(scripts[0]);
    return 'textContent' in element? element.textContent : element.innerText;