javastringgroovyxmlslurpertag-soup

Groovy XmlSlurper with TagSoup and non-breaking space values


I'm parsing some HTML4 with Groovy's XmlSlurper backed by a tagsoup Parser.

I'm getting the text() of a node successfully, but HTML   spaces are giving me some difficulty when trying to test for equality with another value. Specifically, .trim() does not actually trim the string of all whitespace. It appears to me that the characters on either side of the value are whitespace (see code below) but String.trim() isn't trimming the way I'd expect. As can be seen from the code sample, Character.isSpaceChar() for the first character in the string is determined to be a space character.

Why is String.trim() not trimming this value that I've obtained from XmlSlurper?

@Grab('org.ccil.cowan.tagsoup:tagsoup:1.2.1')
import org.ccil.cowan.tagsoup.Parser

def html = '''
<html>
<body>
<span id="interested">&nbsp;hello&nbsp;</span>
</body>
</html>
'''

def slurper = new XmlSlurper(new Parser() )
def document = slurper.parseText(html)

def value = document.'**'.find { it['@id'] == 'interested' }.text()

println "value=[${value}]"
println "first char isWhitespace? ${Character.isWhitespace(value.charAt(0))}"
println "first char isSpaceChar? ${Character.isSpaceChar(value.charAt(0))}"
assert 'hello' == value.trim()

Yields:

value=[ hello ]
first char isWhitespace? false
first char isSpaceChar? true
Exception thrown

Assertion failed: 

assert 'hello' == value.trim()
               |  |     |
               |  |      hello 
               |   hello 
               false

I'm using Groovy Version: 2.3.6 JVM: 1.8.0 Vendor: Oracle Corporation OS: Mac OS X


Solution

  • Here You have corrected example:

    @Grab('org.ccil.cowan.tagsoup:tagsoup:1.2.1')
    import org.ccil.cowan.tagsoup.Parser
    
    def html = '''
    <html>
    <body>
    <span id="interested">&nbsp;hello&nbsp;</span>
    </body>
    </html>
    '''
    
    def slurper = new XmlSlurper(new Parser() )
    def document = slurper.parseText(html)
    
    def value = document.'**'.find { it['@id'] == 'interested' }.text()
    
    println "value=[${value}]"
    println "first char isWhitespace? ${Character.isWhitespace(value.charAt(0))}"
    println "first char isSpaceChar? ${Character.isSpaceChar(value.charAt(0))}"
    value = value.trim()
    println "first char isWhitespace? ${Character.isWhitespace(value.charAt(0))}"
    println "first char isSpaceChar? ${Character.isSpaceChar(value.charAt(0))}"
    assert 'hello' == value.replaceAll(String.valueOf((char) 160), " ").trim()
    

    And explanation can be found here (space vs non-breaking space).