pythonarraysperformancerecursionarray-merge

Python code execution too slow due to bottleneck - seeking performance optimization


I'm currently facing a performance issue with my Python code, specifically with the execution speed. I've identified a bottleneck in my code, and I'm seeking advice on how to optimize it for better performance.

Here's a simplified version of the code:

import numpy as np

def merge_arrays(arr: list, new: list) -> list:
  if len(arr) != len(new):
    raise ValueError(f'Length of <inds> is {len(arr)} but length of <new> is {len(new)}')
  else:
    return transform_array(zip(arr, new), [])

def transform_array(arr: list, r: list):
  for x in arr:
    if isinstance(x, (list, tuple, np.ndarray)):
        transform_array(x, r)
    else:
        r.append(x)
  return r

COUNT = 100000

pips = [(np.arange(5), np.arange(5)) for _ in range(COUNT)]
b = [np.arange(50, 65) for _ in range(COUNT)]
ang = [np.arange(50, 55) for _ in range(COUNT)]
dist = [np.arange(50, 55) for _ in range(COUNT)]

result = [merge_arrays(y, [np.append(b[i][1:], [dist[i][p], ang[i][p]]) for p, i in enumerate(x)]) for x, y in pips]

The issue arises when dealing with large datasets, particularly when the input lists contain a significant number of elements. This leads to slow execution times, which is undesirable for my application.

I've tried to optimize the code by using list comprehensions and avoiding unnecessary condition checks, but the performance is still not satisfactory.

I'm looking for suggestions on how to refactor or optimize this code to improve its performance, especially in the transform_array function where the bottleneck seems to be located.

Any insights or alternative approaches would be greatly appreciated. Thank you!


Solution

  • Reducing the overhead of isinstance

    One problem is that the code makes 9_500_000 calls to isinstance. This is far to much for an interpreted code. Not to mention type introspection is never fast. If we can make the assumption that some items have a well known type and do not need to be tested, then the code can be improved.

    There is a solution for Numpy array since they are typed as opposed to CPython tuples and CPython lists. Note that storing (list, tuple, np.ndarray) in a variable make this 10% faster. Here is a faster code based on this:

    T1 = (list, tuple)
    T2 = (np.ndarray,)
    
    def transform_array(arr: list, r: list):
      for x in arr:
        if isinstance(x, T1):
            transform_array(x, r)
        elif isinstance(x, T2):
            if x.dtype == object:
                # Generic case
                transform_array(x, r)
            else:
                # No need to check all items: 
                # Push back all the items of the array in the list
                r.extend(x.tolist())
        else:
            r.append(x)
      return r
    

    Once this optimization is done, >95% of the time is spent in the result = ... list comprehension and not even in merge_arrays calls. 50% of this time is spent in np.append.


    Reducing the overhead of Numpy calls

    Numpy calls in the list comprehension are quite expensive. This is mainly because Numpy is not designed to do operations on very small array nor to extract many scalar items manually. On top of that np.append is designed to append arrays (and not lists) to arrays so it first converts the list to an array which is also quite expensive (not to mention it needs to check the types of all the items in it).

    We can reduce the overhead by doing the operation manually. Here is a faster implementation:

    #import numba as nb
    #@nb.njit(['(int32[:], int32[:], int32[:], int64)', '(int64[:], int64[:], int64[:], int64)'])
    def build_item(b_i, dist_i, ang_i, p):
        out = np.empty(b_i.size+1, dtype=b_i.dtype)
        out[:-2] = b_i[1:]
        out[-2] = dist_i[p]
        out[-1] = ang_i[p]
        return out
    
    # [...]
    
    result = [merge_arrays(y, [build_item(b[i], dist[i], ang[i], p) for p, i in enumerate(x)]) for x, y in pips]
    

    The line result = ... takes 1.66 seconds on my machine as opposed to 5.92 second for the initial code. It is thus 3.6 times faster.