pythonlistconditional-formattingpruning

Python: How to efficiently select a subset of a list of object instances based on conditions on their attributes?


The minimal demonstrative example what I am trying to do is as follows: Suppose I have a list of instances of a point class, resembling points in the xy-plane, defined by:

class point:
    def __init__(self,x_value, y_value):
        self.x = x_value
        self.y = y_value

r1 = point(0,1)
r2 = point(2,1)
r3 = point(5,6)

all_points = [r1,r2,r3]

Now I want to construct a subset of this list, which only contains those elements which satisfy certain conditions on self.x and self.y, say self.x < 4 and self.y < 4, so the result of the process should give me the list subset = [r1,r2]. Side-note: The order is actually not important, and the elements are unique, so I could in principle also deal with sets instead of lists – not sure if that could be beneficial.

(EDIT: The below is already an improvement over my original code, based on the answers given so far. But I would like to know if I can make it even faster.) What I currently do is

subset = [r for r in all_points if r.x < 4 and r.y < 4]

Which works, but since in my actual case this for loop goes over 60000 entries, and I have to do this pruning many many times, it ends up taking longer than I can afford it to take for my application.

So my question is: Is there a way to prune the list faster? Thanks!


Solution

  • To illustrate the benefits of "preparing" indexing structures, a sorted list of values can be used to quickly get a sub-range of objects that meet a first criterion, using binary search, yielding a smaller set that you can then intersect with other criteria. Depending on the nature of the attributes and the type of criterion (ranges, individual or multiple values) binary searches or hash indexing can produce great preformance boosts:

    Here is an example using a binary search to select eligible points on the x attribute which are then filtered sequentially on the y criterion:

    from random import randrange
    class point:
        def __init__(self,x_value, y_value):
            self.x = x_value
            self.y = y_value
            
    points = [point(randrange(100),randrange(100)) for _ in range(60000)]
    
    xOrder = sorted(points,key = lambda p:p.x) # points in `x` order
    xs     = [p.x for p in xOrder]             # x values for binary search
    
    from bisect import bisect_left
    
    seqFilter = lambda: [p for p in points if p.x<4 and p.y<4]
    binFilter = lambda: [p for p in xOrder[:bisect_left(xs,4)+1] if p.y < 4]
    

    output:

    from timeit import timeit
    
    print("same output:    ", set(seqFilter())==set(binFilter())) 
    print("sequential time:", timeit(seqFilter,number=1000))
    print("binsect time:   ", timeit(binFilter,number=1000))
    
    same output:    True
    sequential time 2.55
    binsect time    0.18  # 14 times faster
    

    Note: this performance improvement is conditioned on the distinctiveness of the first criteria. In my sample data x<4 represents roughly 1/25th of the points to check. Using the most restrictive criterion as the main filter, with the others applied sequentially, may be sufficient depending on your data and criteria values.

    For criteria that don't involve a lot of different values of a given attribute (e.g. x==4 and y==4), set intersections may also be interesting to look into:

    xDict = {n:set() for n in range(101)}
    yDict = {n:set() for n in range(101)}
    for p in points:
        xDict[p.x].add(p)
        yDict[p.y].add(p)
    
    seqFilter  = lambda: [p for p in points if p.x == 4 and p.y == 4]
    binFilter  = lambda: [p for p in xOrder[bisect_left(xs,4):bisect_right(xs,4)] 
                          if p.y == 4]
    dictFilter = lambda: xDict[4] & yDict[4] 
    

    output:

    print("same output:    ", set(seqFilter())==dictFilter())
    print("sequential time ", timeit(seqFilter,number=1000))
    print("binsect time    ", timeit(binFilter,number=1000))
    print("dictFilter time ", timeit(dictFilter,number=1000))
    
    same output:    True
    sequential time 2.458
    binsect time    0.0388   # 63 times faster
    dictFilter time 0.00623  # 394 times faster
    

    Yet another way to improve performance is to use numpy as the filtering mechanism. numpy is generally faster than list comprehensions and doesn't require sorting. Another benefit of this approach is that it is not data dependent and will give very regular response time, for a given data set sise, independently of the distribution of attribute values:

    import numpy as np
    
    pts   = np.array(points)
    npxs  = np.array([p.x for p in points])
    npys  = np.array([p.y for p in points])
    
    seqFilter  = lambda: [p for p in points if p.x < 4 and p.y < 4]
    binFilter  = lambda: [p for p in xOrder[:bisect_left(xs,4)+1] if p.y < 4]
    npFilter   = lambda: pts[(npxs<4)&(npys<4)]
    

    output:

    print("same output:   ", set(seqFilter())==set(npFilter()))
    print("sequential time", timeit(seqFilter,number=1000))
    print("binsect time   ", timeit(binFilter,number=1000))
    print("npFilter time  ", timeit(npFilter,number=1000))
    
    same output:    True
    sequential time 2.55
    binsect time    0.18  # 14 times faster
    npFilter time   0.07  # 36 times faster