I am trying to use regex to match any text except for HTML tags. I have found this solution for "normal" HTML code:
<[^>]*>(*SKIP)(*F)|[^<]+
However, my code is encoded using <
and >
instead of <
and >
, and I have not been able to modify the regex above for it to work.
As an example, given the text:
Hi <p class=\"hello\">\r\nthere, how are you\r\n</p>
I need to match "hi" and "there, how are you". Note that I need to match text that is not between tags as well, "hi", in this example.
UPDATE: since I am using ruby's gsub, it looks like I cannot even use *SKIP and *F
UPDATE 2: I was trying not to get into much detail but seems to be important:
I actually need to replace all the spaces from a text, but not those spaces that are part of a tag, be it a < ... >
tag or a <...>
tag.
You can use
text = text.gsub(/(<.*?>|<[^>]*>)|[[:blank:]]/m) { $1 || '_' }
I suggest [[:blank:]]
instead of \s
since I assume you do not want to replace line breaks. See the Ruby demo.
The regex above matches
(<.*?>|<[^>]*>)
- either <
, any zero or more chars as few as possible, and >
or <
, then zero or more chars other than >
and then a >
|
- or[[:blank:]]
- any single horizontal whitespace (you may also use [\p{Zs}\t]
to match any Unicode horizontal whitespace).The { $1 || '_' }
block in the replacement means that when Group 1 matches, the Group 1 value is returned as is, else, _
is returned as a replacement of a horizontal whitespace.