javascriptweb-scrapingshadow-domcustom-elementnative-web-component

How can I get all the HTML in a document or node containing shadowRoot elements


I have not seen a satisfactory answer for this question. This basically a duplicate of this question, but it was improperly closed and the answers given are not sufficient.

I have come up with my own solution which I will post below.

This can be useful for web scraping, or in my case, running tests on a javascript library that handles custom elements. I make sure it is producing the output that I want, then I use this function to scrape the HTML for a given test output and use that copied HTML as the expected output to compare the test against in the future.


Solution

  • Here is a function that can do what is requested. Note that it ignores html comments and other fringe things. But it retrieves regular elements, text nodes, and custom elements with shadowRoots. It also handles slotted template content. It has not been tested exhaustively but seems to be working well for my needs.

    Use it like extractHTML(document.body) or extractHTML(document.getElementByID('app')).

    function extractHTML(node) {
                
        // return a blank string if not a valid node
        if (!node) return ''
    
        // if it is a text node just return the trimmed textContent
        if (node.nodeType===3) return node.textContent.trim()
    
        //beyond here, only deal with element nodes
        if (node.nodeType!==1) return ''
    
        let html = ''
    
        // clone the node for its outer html sans inner html
        let outer = node.cloneNode()
    
        // if the node has a shadowroot, jump into it
        node = node.shadowRoot || node
        
        if (node.children.length) {
            
            // we checked for children but now iterate over childNodes
            // which includes #text nodes (and even other things)
            for (let n of node.childNodes) {
                
                // if the node is a slot
                if (n.assignedNodes) {
                    
                    // an assigned slot
                    if (n.assignedNodes()[0]){
                        // Can there be more than 1 assigned node??
                        html += extractHTML(n.assignedNodes()[0])
    
                    // an unassigned slot
                    } else { html += n.innerHTML }                    
    
                // node is not a slot, recurse
                } else { html += extractHTML(n) }
            }
    
        // node has no children
        } else { html = node.innerHTML }
    
        // insert all the (children's) innerHTML 
        // into the (cloned) parent element
        // and return the whole package
        outer.innerHTML = html
        return outer.outerHTML
        
    }