I'm parsing some HTML4 with Groovy's XmlSlurper
backed by a tagsoup Parser
.
I'm getting the text()
of a node successfully, but HTML
spaces are giving me some difficulty when trying to test for equality with another value. Specifically, .trim()
does not actually trim the string of all whitespace. It appears to me that the characters on either side of the value are whitespace (see code below) but String.trim()
isn't trimming the way I'd expect. As can be seen from the code sample, Character.isSpaceChar()
for the first character in the string is determined to be a space character.
Why is String.trim()
not trimming this value that I've obtained from XmlSlurper
?
@Grab('org.ccil.cowan.tagsoup:tagsoup:1.2.1')
import org.ccil.cowan.tagsoup.Parser
def html = '''
<html>
<body>
<span id="interested"> hello </span>
</body>
</html>
'''
def slurper = new XmlSlurper(new Parser() )
def document = slurper.parseText(html)
def value = document.'**'.find { it['@id'] == 'interested' }.text()
println "value=[${value}]"
println "first char isWhitespace? ${Character.isWhitespace(value.charAt(0))}"
println "first char isSpaceChar? ${Character.isSpaceChar(value.charAt(0))}"
assert 'hello' == value.trim()
Yields:
value=[ hello ]
first char isWhitespace? false
first char isSpaceChar? true
Exception thrown
Assertion failed:
assert 'hello' == value.trim()
| | |
| | hello
| hello
false
I'm using Groovy Version: 2.3.6 JVM: 1.8.0 Vendor: Oracle Corporation OS: Mac OS X
Here You have corrected example:
@Grab('org.ccil.cowan.tagsoup:tagsoup:1.2.1')
import org.ccil.cowan.tagsoup.Parser
def html = '''
<html>
<body>
<span id="interested"> hello </span>
</body>
</html>
'''
def slurper = new XmlSlurper(new Parser() )
def document = slurper.parseText(html)
def value = document.'**'.find { it['@id'] == 'interested' }.text()
println "value=[${value}]"
println "first char isWhitespace? ${Character.isWhitespace(value.charAt(0))}"
println "first char isSpaceChar? ${Character.isSpaceChar(value.charAt(0))}"
value = value.trim()
println "first char isWhitespace? ${Character.isWhitespace(value.charAt(0))}"
println "first char isSpaceChar? ${Character.isSpaceChar(value.charAt(0))}"
assert 'hello' == value.replaceAll(String.valueOf((char) 160), " ").trim()
And explanation can be found here (space vs non-breaking space).