javascripthtmlregextinymce

How can I match whitespace outside of HTML comments with RegEx?


I would like to replace instances of "\n", "\t", and " " (four spaces) in RegEx, but preserve all whitespace inside of an HTML comment block. Unfortunately, the comment can contain anything, including other HTML tags, so I have to match "<!--" and "-->" specifically. Furthermore, there may be multiple instances of comments with whitespace to match in between. I can use multiple RegEx expressions if needed, but I cannot modify the HTML content aside from the replacement.

Here is some sample code to experiment with:

<div>
    <p>Sample text!</p>
    <!--
        <img src="test.jpg" alt="This is an image!" width="500" height="600">
    -->
</div>
<div>
    <p>Sample text!</p>
    <!--
        <img src="test.jpg" alt="This is an image!" width="500" height="600">
    -->
</div>
<div>
    <p>Sample text!</p>
    <!--
        <img src="test.jpg" alt="This is an image!" width="500" height="600">
    -->
</div>

In this instance, all sets of four spaces should be matched except for the ones in each comment (lines 4, 5, 10, 11, 16, 17).

I have already split up my expressions into one for each type of whitespace, and I have been experimenting with spaces. The closest I have gotten is this:

/(?<!<!--.*?(?<!-->.*?))    (?!(?!.*?<!--).*?-->)/gs

which matches instances of tabs not in the first or last comment block, but it does match tabs in the middle comment blocks which is incorrect. However I suspect it could be accomplished by modifying something in the second half:

/    (?!(?!.*?<!--).*?-->)/gs

Any suggestions? Is this even possible?

UPDATE: In this situation I am not trying to match opening tags; rather, I want the whitespace outside of a specific element (and in this case, the comment block does not have the same syntax as other elements anyways). My ultimate goal here was to use it for a heavily customized instance of TinyMCE in which I wanted to prevent whitespace from being clobbered using the protect attribute. This specifically takes a list of regular expressions and does its own replace with its own <!--mce:protected %0A--> type comment.

After I posted this I then realized that I could just protect the entire comment separately because it would not show up in the editor regardless...


Solution

  • Instead of lookarounds, you could match the comments first and then keep them as is. Then alternatively remove all 4 white spaces.

    Such as

    /(<!--.*?-->)|    /gs
    

    and replace it with $1.

    See the test case

    const text = `<div>
        <p>Sample text!</p>
        <!--
            <img src="test.jpg" alt="This is an image!" width="500" height="600">
        -->
    </div>
    <div>
        <p>Sample text!</p>
        <!--
            <img src="test.jpg" alt="This is an image!" width="500" height="600">
        -->
    </div>
    <div>
        <p>Sample text!</p>
        <!--
            <img src="test.jpg" alt="This is an image!" width="500" height="600">
        -->
    </div>`;
    
    
    const output = text.replace(/(<!--.*?-->)|    /gs, '$1');
    
    console.log(output);