pythoncachingdependenciesdependency-managementlru

Pythonic approach to keeping track of cached variable/function dependencies


I have a system with a library which includes many functions/methods that are slow, for example SQL queries or computational expensive algorithms. Therefore, I have identified those that can benefit from caching and use the lru_cache or cache decorators from functools. I additionally use cache_clear() to clear caches when a function's/method's dependent data/parameters have changed.

This is simple when the number of dependent data or parameters are small, for example when testing, however, when this approach is scaled to include data dependent on data from other cached function/methods, keeping track of these dependent data/parameters' changes and appropriately clearing cache(s) is tricky. See code example 1, which includes some scenarios of when dependent data are changed and the clearing of caches. In this case I must manually keep track and place cache_clear() on the appropriate functions/methods.

In an attempt to have a more systematic approach to tracking data dependencies and clearing caches I have a simple class to manage this. See code example 2 for this updated approach. I have split the problem in to two halves, 1) declaring data dependencies, and 2) notifying the manager when a variable is changed. The manager then clears any caches for functions/methods that depend on the changed variable.

This approach relies on using string descriptions of data. Ideally, so that I do not need to remember/use the exact same string descriptions, actual references to data would be used, for example with id(), however when data are updated the reference can change. See code example 3. The reference stored when declaring the dependency is different after the change to b.Y so the manager can't find which caches need to be cleared.

Example 1:

from functools import lru_cache, cache

class A():
    def __init__(self):
        self._X = 0
        self.Y = (1, 2, 3)

    @classmethod
    def set_up(cls, z):
        cls.Z = z
    
    @property
    @cache
    def X(self):
        return self._X

    @X.setter
    def X(self, new_X):
        self._X = new_X
        # A.X.fget.cache_clear()
    
    @cache
    def N(self):
        return len(str(self.X))

    @lru_cache(maxsize=None, typed=True)
    def bar(self, y):
        return self.Y * y

    @cache
    def foobar(self):
        return self.bar(self.X) * self.N()

@lru_cache(maxsize=10, typed=True)
def func(ans):
    ans *= A.Z
    return ans

a = A()
a.set_up(3)

# Case 1
# X is changed, so both X and N() need to be cleared
assert a.X == 0
assert a.N() == 1
a.X = 20
assert a.X == 0
assert a.N() == 1
A.X.fget.cache_clear() # could put this in the property setter
a.N.cache_clear()
assert a.X == 20
assert a.N() == 2

# Case 2
# Y is changed so bar() needs to be cleared
assert a.bar(4) == (1, 2, 3) * 4
a.Y = (1, 2, 3, 4, 5, 6)
assert a.bar(4) == (1, 2, 3) * 4
a.bar.cache_clear()
assert a.bar(4) == (1, 2, 3, 4, 5, 6) * 4

# Case 3
# if either a.X or a.Y changes, one or more of X, N(), bar() and foobar()
#     need to be cleared (I can simplify to all - commented lines)
assert a.foobar() == (1, 2, 3, 4, 5, 6) * 20 * 2
a.X = 500
a.Y = (7, 8, 9)
assert a.foobar() == (1, 2, 3, 4, 5, 6) * 20 * 2
A.X.fget.cache_clear()
a.N.cache_clear()
a.bar.cache_clear()
a.foobar.cache_clear()
assert a.foobar() == (7, 8, 9) * 500 * 3

# Case 4
# if A.Z changes, func() needs to be cleared
assert func(a.foobar()) == ((7, 8, 9) * 500 * 3) * 3
a.set_up(4)
assert func(a.foobar()) == ((7, 8, 9) * 500 * 3) * 3
func.cache_clear()
assert func(a.foobar()) == ((7, 8, 9) * 500 * 3) * 4

Example 2:

from collections import defaultdict

class CacheManager():
    def __init__(self):
        self._caches = defaultdict(list)
    
    def add_dependency(self, dependent, dependee):
        # dependent: that which relies on another for support
        # dependee: that which the dependent relies on
        self._caches[dependee].append(dependent)
    
    def changed(self, dependee):
        for d in dependee:
            for f in self._caches[d]:
                f.cache_clear()

cache_manager= CacheManager()
cache_manager.add_dependency(func, "A.Z")

class B(A):
    def __init__(self, cashe_manager ):
        super().__init__()
        
        self._cache_manager = cache_manager 
        
        self._cache_manager.add_dependency(B.X.fget, "b.X")
        self._cache_manager.add_dependency(self.N, "b.X")
        self._cache_manager.add_dependency(self.bar, "b.Y")
        self._cache_manager.add_dependency(self.foobar, "b.X")
        self._cache_manager.add_dependency(self.foobar, "b.N")        

b = B(cache_manager )

# Case 1
# X is changed, so both X and N() need to be cleared
assert b.X == 0
assert b.N() == 1
b.X = 20
assert b.X == 0
assert b.N() == 1
cache_manager.changed(("b.X",))
assert b.X == 20
assert b.N() == 2

# Case 2
# Y is changed so bar() needs to be cleared
assert b.bar(4) == (1, 2, 3) * 4
b.Y = (1, 2, 3, 4, 5, 6)
assert b.bar(4) == (1, 2, 3) * 4
cache_manager.changed(("b.Y",))
assert b.bar(4) == (1, 2, 3, 4, 5, 6) * 4

# Case 3
# if either a.X or a.Y changes, one or more of X, N(), bar() and foobar()
#     need to be cleared (I can simplify to all - commented lines)
assert b.foobar() == (1, 2, 3, 4, 5, 6) * 20 * 2
b.X = 500
b.Y = (7, 8, 9)
assert b.foobar() == (1, 2, 3, 4, 5, 6) * 20 * 2
cache_manager.changed(("b.X", "b.Y"))
assert b.foobar() == (7, 8, 9) * 500 * 3

# Case 4
# a new instance of A is instantiated so func() needs to be cleared
assert func(b.foobar()) == ((7, 8, 9) * 500 * 3) * 4
a = A()
a.set_up(2)
assert func(b.foobar()) == ((7, 8, 9) * 500 * 3) * 4
cache_manager.changed(("A.Z",))
assert func(b.foobar()) == ((7, 8, 9) * 500 * 3) * 2

Example 3:

class CacheManager():
    def __init__(self):
        self._caches = defaultdict(list)
    
    def add_dependency(self, dependent, dependee):
        # dependent: that which relies on another for support
        # dependee: that which the dependent relies on
        print(f"Adding {dependee} with id {id(dependee)}")
        self._caches[id(dependee)].append(dependent)
    
    def changed(self, dependee):
        print(f"{dependee} changed")
        for d in dependee:
            print(f"for dependee {d}, {self._caches[id(d)]} caches ... ")
            for f in self._caches[id(d)]:
                print(f"... clearing cache of {d} with id {id(d)}")
                f.cache_clear()

cache_manager = CacheManager()
cache_manager.add_dependency(func, A.Z)

class B(A):
    def __init__(self, cache_manager):
        super().__init__()
        
        self._cache_manager= cache_manager
        
        self._cache_manager.add_dependency(B.X.fget, self.X)
        self._cache_manager.add_dependency(self.N, self.X)
        self._cache_manager.add_dependency(self.bar, self.Y)
        self._cache_manager.add_dependency(self.foobar, self.X)
        self._cache_manager.add_dependency(self.foobar, self.N)        

b = B(cache_manager)

# Case 1
# X is changed, so both X and N() need to be cleared
assert b.X == 0
assert b.N() == 1
b.X = 20
assert b.X == 0
assert b.N() == 1
cache_manager.changed((b.X,))
assert b.X == 20
assert b.N() == 2

# Case 2
# Y is changed so bar() needs to be cleared
assert b.bar(4) == (1, 2, 3) * 4
b.Y = (1, 2, 3, 4, 5, 6)
assert b.bar(4) == (1, 2, 3) * 4
cache_manager.changed((b.Y,))
assert b.bar(4) == (1, 2, 3, 4, 5, 6) * 4 # <---- Fails

Solution

  • I think the idea of using the cache decorators on non-idempotent functions is a bit of an abuse of the API. The idea is generally that you can cache the values on the objects because there are no side-effects which would require the outputs to change on subsequent calls.

    In the example for class A you provide a method for getting the cached result of property X, which relies on a mutable state of attribute A._X. In this case, rather than caching on the property, a better alternative might be to have that property call to an idempotent function which takes A._X as an argument, which would lead to mutations of A._X causing a cache miss and a recalculation of the value:

    from functools import cache
    
    @cache
    def _calculate_x(a_x):
        return a_x
    
    
    class A():
        def __init__(self):
            self._X = 0
    
        @property
        def X(self):
            return _calculate_x(self._X)
    
    
    a = A()
    
    assert a.X == 0
    a._X = 20
    assert a.X == 20
    

    If the usecase is simply for testing and clearing the cache, I would probably opt for a simpler approach where on each test-case you can easily clear all caches regardless of what dependencies are changed. This ensures that each test run with a blank slate.

    One way to do this could be to create wrapper around creating cache objects which stores them in a global list before returning them to the caller. e.g.

    from functools import cache
    
    _caches = []
    
    def application_cache(func):
        cached_func = cache(func)
        _caches.append(cached_func)
        return cached_func
    
    def cache_clear():
        for func_cache in _caches:
            func_cache.cache_clear()
    
    ...
    
    class A:
        def __init__(self):
            self._X = 0
    
        @property
        @application_cache
        def X(self):
            return self._X
    
    
    a = A()
    
    assert a.X == 0
    a._X = 20
    assert a.X == 0
    
    cache_clear()
    assert a.X == 20
    

    This would however result in memory leaks for short lived object instances that create caches, as they are never cleared from the _caches. A reference to the wrapped method from _caches would lead to the associated instance never being GC'd. If that is a concern, then in the real production code, you could have:

    from functools import cache
    
    def application_cache(func):
        return cache(func)
    

    In your test setup you would patch the application_cache function to use the previous implementation with the global _caches list. This would allow you to use the real cache methods in production, but provide the cache_clear method that globally clears all cached data from within your tests.