htmlxquerybasextag-soup

Error parsing html with extended unicode characters with basex


I have been facing issue with parsing html with extended unicode characters using the basex html parser. Is it possible to make the parser support special characters?

Code:

let $htmlRaw := '<span class="eqn">&#120746; + &#120747; = &#120748;</span>'
let $htmlParsed := html:parse($htmlRaw, map { 'encoding': 'utf-8'})
return (
  'INPUT', 
  $htmlRaw,
  'OUTPUT',
  $htmlParsed
)

Output:

INPUT
<span class="eqn">𝞪 + 𝞫 = 𝞬</span>
OUTPUT
<html>
  <body>
    <span class="eqn">?? + ?? = ??</span>
  </body>
</html>

The bug seems to be related to output-encoding parameter of tagsoup library which basex doesn't support.

for eg:-

$ echo "<span class="eqn">&#120746; + &#120747; = &#120748;</span>" | java -jar tagsoup-1.2.1.jar --html

<html><body><span class="eqn">&#55349;&#57258; + &#55349;&#57259; = &#55349;&#57260;</span>
</body></html>

$ echo "<span class="eqn">&#120746; + &#120747; = &#120748;</span>" | java -jar tagsoup-1.2.1.jar --html --output-encoding=utf-16
<html><body><span class="eqn">𝞪 + 𝞫 = 𝞬</span>
</body></html>

Solution

  • If I add opt(writer, "encoding", Strings.UTF8); as line 156 in HtmlParser.java (https://github.com/martin-honnen/basex/commit/4711a390e4069d363243f48c95456544916f40f7) of BaseX the problems seems to go away. I am not sure, however, this is the right way to fix it.

    The root of the problem seems to be two issues, TagSoup, without having the output encoding of the XMLWriter set to any Unicode encoding like UTF-8 or UTF-16, outputs two numeric character references representing an Unicode character outside of the BMP.

    So you have to set UTF-8 or UTF-16 as the output encoding of TagSoup's XMLWriter as then it switches to Unicode mode and just outputs characters and not character references, with both encodings the XMLWriter of TagSoup seems to feed the right characters to the StringWriter BaseX sets up.

    Furthermore, BaseX's internal String to byte[] conversion seems to expect UTF-8 encoded strings, not sure why that is the case on the Java platform, but the token function delegates work to an utf8 function.

    So that way the fix in the HtmlParser seems to be to set opt(writer, "encoding", Strings.UTF8).