regexrubyhtml-parsingruby-1.9.3

Regex to match anything except HTML tags when code is encoded using < and >


I am trying to use regex to match any text except for HTML tags. I have found this solution for "normal" HTML code:

<[^>]*>(*SKIP)(*F)|[^<]+

However, my code is encoded using &lt; and &gt; instead of < and >, and I have not been able to modify the regex above for it to work.

As an example, given the text:

Hi &lt;p class=\"hello\"&gt;\r\nthere, how are you\r\n&lt;/p&gt;

I need to match "hi" and "there, how are you". Note that I need to match text that is not between tags as well, "hi", in this example.

UPDATE: since I am using ruby's gsub, it looks like I cannot even use *SKIP and *F

UPDATE 2: I was trying not to get into much detail but seems to be important: I actually need to replace all the spaces from a text, but not those spaces that are part of a tag, be it a &lt; ... &gt; tag or a <...> tag.


Solution

  • You can use

    text = text.gsub(/(&lt;.*?&gt;|<[^>]*>)|[[:blank:]]/m) { $1 || '_' }
    

    I suggest [[:blank:]] instead of \s since I assume you do not want to replace line breaks. See the Ruby demo.

    The regex above matches

    The { $1 || '_' } block in the replacement means that when Group 1 matches, the Group 1 value is returned as is, else, _ is returned as a replacement of a horizontal whitespace.