pythonlxml

How to Cache Elements to increase the Runtime Performance with lxml Pythin Library


In the lxml.de website https://lxml.de/performance.html I see the following statement:

A way to improve the normal attribute access time is static instantiation of the Python objects, thus trading memory for speed. Just create a cache dictionary and run:

cache[root] = list(root.iter()) after parsing and:

del cache[root]

Can anyone provide me a suitable Python Code example about how these above mechanism can be used in a Python Function?


Solution

  • Setting a variable like cache[root] = list(root.iter()) will effectively cache objects in memory as demonstrated by a simple test.
    The cache mechanism is very simple: the whole document tree is loaded in memory and elements can be obtained in different ways but point to the same memory address.

    Given an XML document, get the id of an object before and after setting the cache. The id will be the same of the cache after setting it

    from lxml import etree, objectify
    otree = objectify.parse('tmp2.xml')
    root = otree.getroot()
    print(id(root.Form_1.Country), root.Form_1.Country)
    
    cache = {}
    cache[root] = list(otree.iter())
    print(id(cache[root][3]), cache[root][3])
    print(id(root.Form_1.Country), root.Form_1.Country)
    
    # both point to the same object in memory
    print(root.Form_1.Country is cache[root][3])
    
    # the object can be obtained in different ways but point to the same object in the cache
    ele1 = root.xpath('(//Form_1/Country)[1]')[0]
    
    print(ele1 is cache[root][3])
    

    Result

    140257476833728 AFG
    
    140257476833280 AFG
    140257476833280 AFG
    
    True
    True
    

    As explained in the link posted by the OP, it's trading memory for speed

    A way to improve the normal attribute access time is static instantiation of the Python objects, thus trading memory for speed

    Test XML

    <Forms>
        <greeting>Hello, world!</greeting>
        <Form_1>
            <Country>AFG</Country>
            <Country>AFG</Country>
            <Country>IND</Country>
        </Form_1>
        <Form_1>
            <Country>IND</Country>
            <Country>USA</Country>
        </Form_1>
    </Forms>