htmlregexanonymize

Anonymize html with regex


I'm trying to anonymize a HTML string with regex, for an SQL query.

https://regex101.com/r/QWt1E1/1

(?<!\<)[^<>\s](?!\>)
<p><em>Hi [User</em></p>
<p><em>Tack f&ouml;r visat intresse.</em></p>
<p><em>Good luck!</em><em>&nbsp;</em></p>
<p><em>Sincerely</em></p>
<p><em>nn nnnnn</nm></p>
<p><em>nnnn nnnnnnnn nnnnn nnnnnnnnn</nm></p>
<p><em>nnnn nnnnn</nm><em>nnnnnn</nm></p>
<p><em>nnnnnnnnn</nm></p>

The plan was to replace every character that is not within <>, with an n. It almost works, but in my example it replaces the e in </em>. Not sure why and how to fix that.

How can I adjust the regex to not replace the e in the example?


Solution

  • Negative lookahead for [^<>]*> instead of just >, to ensure that the current position is not followed by a > before any other angle brackets (because that would indicate you're currently inside a tag).

    This also means that you can drop the lookbehind:

    [^<>\s](?![^<>]*>)
              ^^^^^^
    

    https://regex101.com/r/QWt1E1/3

    Still, it would be better to parse the HTML using an HTML parser, if at all possible