pythonlxml

How to prevent lxml from converting '&' character to '&'?


I need to send the control characters 
 and 
 in my XML file so that the text is displayed correctly in the target system.

For the creation of the XML file I use the lxml library. This is my attempt:

from lxml import etree as et
import lxml.builder

e = lxml.builder.ElementMaker()

xml_doc = e.newOrderRequest(
    e.Orders(
        e.Order(
            e.OrderNumber('12345'),
            e.OrderID('001'),
            e.Articles(
                e.Article(
                    e.ArticleNumber('000111'),
                    e.ArticleName('Logitec Mouse'),
                    e.ArticleDescription('* 4 Buttons
* 600 DPI
* Bluetooth')
                )
            )
        )
    )
)

tree = et.ElementTree(xml_doc)
tree.write('output.xml', pretty_print=True, xml_declaration=True, encoding="utf-8")

This is the result:

<?xml version='1.0' encoding='UTF-8'?>
<newOrderRequest>
  <Orders>
    <Order>
      <OrderNumber>12345</OrderNumber>
      <OrderID>001</OrderID>
      <Articles>
        <Article>
          <ArticleNumber>000111</ArticleNumber>
          <ArticleName>Logitec Mouse</ArticleName>
          <ArticleDescription>* 4 Buttons&amp;#x0D;&amp;#x0A;* 600 DPI&amp;#x0D;&amp;#x0A;* Bluetooth</ArticleDescription>
        </Article>
      </Articles>
    </Order>
  </Orders>
</newOrderRequest>

This is what I need:

<ArticleDescription>* 4 Buttons&#x0D;&#x0A;* 600 DPI&#x0D;&#x0A;* Bluetooth</ArticleDescription>

Is there a function in the lxml library to turn off the conversion or does anyone know a way to solve this problem? Thanks in advance.


Solution

  • The output of the Python script:

    import lxml.etree as et
    print(repr(et.fromstring('''<ArticleDescription>* 4 Buttons&#x0D;&#x0A;* 600 DPI&#x0D;&#x0A;* Bluetooth</ArticleDescription>''').text))
    

    ...is...

    '* 4 Buttons\r\n* 600 DPI\r\n* Bluetooth'
    

    That means that the Python-syntax way to write the XML-syntax string * 4 Buttons&#x0D;&#x0A;* 600 DPI&#x0D;&#x0A;* Bluetooth is as '* 4 Buttons\r\n* 600 DPI\r\n* Bluetooth'.

    Thus, the relevant line of code should be:

    e.ArticleDescription('* 4 Buttons\r\n* 600 DPI\r\n* Bluetooth')
    

    ...and if the consumer doesn't treat the resulting output as exactly identical to import lxml.etree as et print(repr(et.fromstring('''<ArticleDescription>* 4 Buttons&#x0D;&#x0A;* 600 DPI&#x0D;&#x0A;* Bluetooth</ArticleDescription>, that consumer is broken.

    See https://replit.com/@CharlesDuffy2/ImportantClassicConversion#test.py running your code with the modification suggested above.