scalacharacter-encodingchardet

juniversalchardet is defective on www.wikipedia.org


I'm trying to use juniversalchardet to auto-detect encoding of a saved webpage, my first test use www.wikipedia.org, which uses UTF-8 encoding according to HTTP response header (this information is lost after being saved to disk)

This is my scala code in doing so:

    val content = <...load Wikipedia.html from disk...>
    val charsetD = new UniversalDetector(null)
    charsetD.handleData(content, 0, content.length)
    val charset = charsetD.getDetectedCharset

However regardless of what I load, the charset result is always 'null', is it because the juniversalchardet library is defective? Or I'm using it wrong?


Solution

  • problem solved, charsetD.handleData(content, 0, content.length) cannot handle a batch longer than 4096 bytes. Everything works after this function is used several times on chunks of data.