I'm looking for a regular expression in Java which matches all whitespace characters in a String. "\s" matches only some, it does not match
and similar non-ascii whitespaces. I'm looking for a regular expression which matches all (common) white-space characters which can occur in a Java String.
[Edit]
To clarify: I do not mean the string sequence "
" I mean the sincle unicode character U+00A0 that is often represented by "
", e.g. in HTML, and all other unicode characters with a similar white-space meainig, e.g. "NARROW NO-BREAK SPACE" (U+202F), Word joiner encoded in Unicode 3.2 and above as U+2060, "ZERO WIDTH NO-BREAK SPACE" (U+FEFF) and any other character that can be regareded as white-space.
[Edit]
\p{javaWhitespace}
does not match U+00A0 (
) and is not the answer I'm looking for.
The
is only whitespace in HTML. Use an HTML parser to extract the plain text. and \s
should work just fine.