pythonlxmlxmlcatalog

Python LXML catalog lookup


I'm making an SCons file for building Docbook documentation. In order to trace dependencies I would like some way to resolve catalog file lookups to an absolute path to a file.

So say I have a bit of Docbook XML :

<book xmlns="http://docbook.org/ns/docbook"
      xmlns:xi="http://www.w3.org/2001/XInclude">

  <info> 
    <title>Docbook example document</title>

    <xi:include href="file:///common/logo.xml"
        xpointer="logo"/>

  </info>
  <xi:include href="chap1/chap1.xml"/>
  <xi:include href="chap2/chap2.xml"/>
  <xi:include href="chap3/chap3.xml"/>
  <xi:include href="chap4/chap4.xml"/>

</book>

and a catalog.xml file :

<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">

  <rewriteURI
    uriStartString="file:///stylesheet/"
    rewritePrefix="file:///home/kst/svn/TOOLS/Docbook/stylesheet/" />

  <rewriteURI
    uriStartString="file:///common/"
    rewritePrefix="file:///home/kst/svn/TOOLS/Docbook/common/" />


  <nextCatalog  catalog="/etc/xml/catalog" />

</catalog>

Getting the xinclude href string is no problem using lxml but I'm stuck there. What I need is some way to get the absolute filename that file:///common/logo.xml resolves to (in this case /home/kst/svn/TOOLS/Docbook/common/logo.xml) from the catalog file. It needs to be some kind of Python code so I can use it in my SConstruct file without too much hassle.

Any help is appreciated.


Solution

  • Lxml uses the catalog support from libxml2. Use the environment variable XML_CATALOG_FILES to provide a list of catalogs (you could set this from python as well, using os.environ), or, if this variable is not present, it checks for the existence of /etc/xml/catalog (can't use this one on windows of course).

    An alternative would be to use a custom URI resolver. You can find more information in the lxml docs

    EDIT: apparently, the question was not about the actual xinclude processing, which works, but about a way to "query" the catalog, or ask it for the actual filenames that would be used for the inclusions.

    Lxml (at least currently) has no API to do that. The underlying libxml2 library does support this, however, and the "original" libxml2 python bindings allow you to do this (easy documentation is lacking though, the docstrings in the source code of the libxml2 help, however). So, although this module is not nearly as nice to use than lxml, it seems to be your best bet. Example which seems to work:

    >>> import libxml2
    >>> libxml2.loadCatalog('catalog.xml')
    >>> print libxml2.catalogResolveURI('file:///common/logo.xml')
    file:///home/kst/svn/TOOLS/Docbook/common/logo.xml