pythonbeautifulsoupjython-2.5html5libqf-test

Jython 2.5.1: UnicodeDecodeError


recently I have been trying to parse data from HTML file using Jython scripts in QF-Test 3.5.4 (note that the supported Python version is only 2.5.1 as per release notes for version 3.5.1. - http://www.qfs.de/en/qftest/relnotes.html#3.5.1).

Python libraries (old because I needed support for Python 2.x):

I am running Xubuntu 13.10.

The Jython script looks like this:

    #Script uses obsolete Python libraries because QF-Test only supports Python 2.5.1
import urllib

#BeautifulSoup 3.2.1 - Python 2.x support
import BeautifulSoup

#html5lib 0.95 - has Python 2.5.1 support
from html5lib import sanitizer
from html5lib import treebuilders

#URL of HTML file that has been saved locally
url = 'Tlacovky/$(website)'
fp = urllib.urlopen(url)

#create HTML5 parser
parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"), tokenizer=sanitizer.HTMLSanitizer)
html5lib_object = parser.parse(file_pointer)
html_string = str(html5lib_object)

#load to BS
soup = BeautifulSoup(html_string)

for content in soup.findAll('script'):
    print content

Now when I try to execute the script with all variables I need correctly set I get this:

UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 48-54: illegal Unicode character

    at org.python.core.PyException.fillInStackTrace(PyException.java:70)
    at java.lang.Throwable.<init>(Throwable.java:181)
    at java.lang.Exception.<init>(Exception.java:29)
    at java.lang.RuntimeException.<init>(RuntimeException.java:32)
    at org.python.core.PyException.<init>(PyException.java:46)
    at org.python.core.PyException.doRaise(PyException.java:200)
    at org.python.core.Py.makeException(Py.java:1171)
    at org.python.core.Py.makeException(Py.java:1175)
    at org.python.core.Py.makeException(Py.java:1179)
    at org.python.core.Py.makeException(Py.java:1183)
    at qfcommon$py.runscript$52(/opt/qftest/qftest-3.5.4/jython/Lib/qfcommon.py:962)
    at qfcommon$py.call_function(/opt/qftest/qftest-3.5.4/jython/Lib/qfcommon.py)
    at org.python.core.PyTableCode.call(PyTableCode.java:165)
    at org.python.core.PyBaseCode.call(PyBaseCode.java:182)
    at org.python.core.PyFunction.__call__(PyFunction.java:350)
    at qftest$py.runscript$3(/opt/qftest/qftest-3.5.4/jython/Lib/qftest.py:91)
    at qftest$py.call_function(/opt/qftest/qftest-3.5.4/jython/Lib/qftest.py)
    at org.python.core.PyTableCode.call(PyTableCode.java:165)
    at org.python.core.PyBaseCode.call(PyBaseCode.java:182)
    at org.python.core.PyFunction.__call__(PyFunction.java:350)
    at org.python.pycode._pyx386.f$0(<string>:1)
    at org.python.pycode._pyx386.call_function(<string>)
    at org.python.core.PyTableCode.call(PyTableCode.java:165)
    at org.python.core.PyCode.call(PyCode.java:18)
    at org.python.core.Py.runCode(Py.java:1209)
    at org.python.core.Py.exec(Py.java:1253)
    at org.python.util.PythonInterpreter.exec(PythonInterpreter.java:173)
    at de.qfs.apps.qftest.shared.script.JythonEngine.exec(SourceFile:195)
    at org.apache.bsf.BSFManager$6.run(Unknown Source)
    at java.security.AccessController.doPrivileged(Native Method)
    at org.apache.bsf.BSFManager.exec(Unknown Source)
    at de.qfs.apps.qftest.run.RMIRunContext.runScript(SourceFile:1875)
    ... 16 more

I was successful to trace the problem to the importing of "inputstream.py" which is the point when the error occurs.

I am literally pulling my hair out with this one. If you can, please help me resolve this problem.

EDIT:

Fixed by modifying inputstream.py:

invalid_unicode_re = re.compile("[\u0001-\u0008\u000B\u000E-\u001F\u007F-\u009F\uD800-\uDFFF\uFDD0-\uFDEF\uFFFE\uFFFF\U0001FFFE\U0001FFFF\U0002FFFE\U0002FFFF\U0003FFFE\U0003FFFF\U0004FFFE\U0004FFFF\U0005FFFE\U0005FFFF\U0006FFFE\U0006FFFF\U0007FFFE\U0007FFFF\U0008FFFE\U0008FFFF\U0009FFFE\U0009FFFF\U000AFFFE\U000AFFFF\U000BFFFE\U000BFFFF\U000CFFFE\U000CFFFF\U000DFFFE\U000DFFFF\U000EFFFE\U000EFFFF\U000FFFFE\U000FFFFF\U0010FFFE\U0010FFFF]")

#Craziness
if len("\U0010FFFF") == 1:
            self.reportCharacterErrors = self.characterErrorsUCS4
            self.replaceCharactersRegexp = re.compile("[\uD800-\uDFFF]")
        else:
            self.reportCharacterErrors = self.characterErrorsUCS2
            self.replaceCharactersRegexp = re.compile("([\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?<![\uD800-\uDBFF])[\uDC00-\uDFFF])")

Solution

  • [Largely rewritten mid-2016 to bring up to date.]

    html5lib doesn't support Jython 2.5, and never has. Some degree of support was introduced in html5lib 0.9999, but that requires Jython 2.7 (notably, support isn't guaranteed, but in principle it works).

    If you want to try and get it working with Jython 2.5, you need to more than just replace invalid_unicode_re, see this bug. I'd suggest trying to run the testsuite with your modifications. Note also that nowadays we require Python 2.6 as a minimum, and support for any variant of 2.5 will take a large amount of work now.