I'm trying to anonymize a HTML string with regex, for an SQL query.
https://regex101.com/r/QWt1E1/1
(?<!\<)[^<>\s](?!\>)
<p><em>Hi [User</em></p>
<p><em>Tack för visat intresse.</em></p>
<p><em>Good luck!</em><em> </em></p>
<p><em>Sincerely</em></p>
<p><em>nn nnnnn</nm></p>
<p><em>nnnn nnnnnnnn nnnnn nnnnnnnnn</nm></p>
<p><em>nnnn nnnnn</nm><em>nnnnnn</nm></p>
<p><em>nnnnnnnnn</nm></p>
The plan was to replace every character that is not within <>, with an n
.
It almost works, but in my example it replaces the e
in </em>
. Not sure why and how to fix that.
How can I adjust the regex to not replace the e
in the example?
Negative lookahead for [^<>]*>
instead of just >
, to ensure that the current position is not followed by a >
before any other angle brackets (because that would indicate you're currently inside a tag).
This also means that you can drop the lookbehind:
[^<>\s](?![^<>]*>)
^^^^^^
https://regex101.com/r/QWt1E1/3
Still, it would be better to parse the HTML using an HTML parser, if at all possible