pythonhtmllxmlwbr

lxml and <wbr> tags


By default lxml doesn't understsand the wbr tag, used to add word-breaks in long words. It formats it as <wbr></wbr> when it should be formatted simply as <wbr>, similar to the br tag.

How do I add this behavior to lxml?


Solution

  • Actually it is not difficult to patch libxml2 (this walkthrough was done on Ubuntu 11.04 with Python 2.7.3)

    First define a test program wbr_test.py:

    from lxml import etree
    from cStringIO import StringIO
    
    wbr_html = """\
    <html>
      <head>
        <title>wbr test</title>
      </head>
    <body>
      Test for a breakable<wbr>word implemenation change
    </body>
    </html>
    """
    
    parser = etree.HTMLParser()
    tree   = etree.parse(StringIO(wbr_html), parser)
    
    result = etree.tostring(tree.getroot(),
                             pretty_print=True, method="html")
    if result.split() != wbr_html.split(): # split, as we are not interested in whitespace differences
        print(result)
        print("not ok")
    else:
        print("OK")
    

    Make sure that it fails by running python wbr_test.py. It should insert a <\wbr> before <\body>, and print not ok at the end.

    Download, extract and compile libxml2:

    wget ftp://xmlsoft.org/libxml2/libxml2-2.8.0.tar.gz
    tar xvf libxml2-2.8.0.tar.gz 
    cd libxml2-2.8.0/
    ./configure --prefix=/usr
    make -j8  # adjust number to match your number of cores
    

    Install, and install python libxml2 bindings:

    sudo make install
    cd to_python_bindings
    sudo python setup.py install
    

    Test your wbr_test.py once more, to make sure it fails with the latest libxml2 version.

    First make a copy of HTMLparser.c e.g. in /var/tmp.

    Now edit the the file HTMLparser.c at the toplevel of the libxml2 source. Search for the word forced (only one occurrence). You will be at the <br> tag definition. Copy the three lines starting with the line you just found. The most appropriate insert point is just before the end (after the definition of <var>). To get the final comma right in the table insert the three lines before the one with just '}' not the one with '};'.

    In the newly inserted code Replace br with wbr and change DECL clear_attrs to NULL (assuming that a new tag does not have deprecated attributes).

    The result should diff with the version in /var/tmp ( diff -u HTMLparser.c /var/tmp) as follows:

    @@ -1039,6 +1039,9 @@
     },
     { "var",   0, 0, 0, 0, 0, 0, 1, "instance of a variable or program argument",
    DECL html_inline, NULL, DECL html_attrs, NULL, NULL
    +},
    +{ "wbr",   0, 2, 2, 1, 0, 0, 1, "possible line break ",
    +   EMPTY , NULL , DECL core_attrs, NULL , NULL
     }
     };
    

    Make and install:

    make && sudo make install
    

    Test your wbr_test.py once more. Should show OK