regexxmlnotepad++negate

Regex to remove all except XML


I need help with a Regex for notepad++ to match all but XML

The regex I'm using: (!?\<.*\>) <-- I want the opposite of this (in first three lines)

The example code:

[20173003] This text is what I want to delete [<Person><Name>Foo</Name><Surname>Bar</Surname></Person>], and this text too.
[20173003] This is another text to delete [<Person><Name>Bar</Name><Surname>Foo</Surname></Person>]
[20173003] This text too... [<Person><Name>Lorem</Name><Surname>Ipsum</Surname></Person>], delete me!
[20173003] But things like this make the regex to fail < [<Person><Name>Lorem</Name><Surname>Ipsum</Surname></Person>], or this>

Expected result:

<Person><Name>Foo</Name><Surname>Bar</Surname></Person>
<Person><Name>Bar</Name><Surname>Foo</Surname></Person>
<Person><Name>Lorem</Name><Surname>Ipsum</Surname></Person>
<Person><Name>Lorem</Name><Surname>Ipsum</Surname></Person>

Thanks in advance!


Solution

  • This is not perfect, but should work with your input that looks quite simple and well-structured.

    If you need to handle just a single unnested <Person> tag, you may use simple (<Person>.*?</Person>)|. regex (that will match and capture into Group 1 any <Person> tag and will match any other char) and replace with a conditional replacement pattern (?{1}$1\n:) (that will reinsert Person tag with a newline after it or will replace the match with an empty string):

    enter image description here

    To make it a bit more generic, you may capture the opening and corresponding closing XML tags with a recursion-based Boost regex, and the appropriate conditional replacement pattern:

    Find What:      (<(\w+)[^>]*>(?:(?!</?\2\b).|(?1))*</\2>)|.
    Replace With: (?{1}$1\n:)
    . matches newline: ON

    enter image description here

    Regex Details:

    Replacement pattern: