pythonpandasperformancenumexpr

Why is Pandas.eval() with numexpr so slow?


Test code:

import numpy as np
import pandas as pd

COUNT = 1000000

df = pd.DataFrame({
    'y': np.random.normal(0, 1, COUNT),
    'z': np.random.gamma(50, 1, COUNT),
})

%timeit df.y[(10 < df.z) & (df.z < 50)].mean()
%timeit df.y.values[(10 < df.z.values) & (df.z.values < 50)].mean()
%timeit df.eval('y[(10 < z) & (z < 50)].mean()', engine='numexpr')

The output on my machine (a fairly fast x86-64 Linux desktop with Python 3.6) is:

17.8 ms ±  1.3 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.44 ms ±  502 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
46.4 ms ± 2.22 ms per loop (mean ± std. dev. of 7 runs,  10 loops each)

I understand why the second line is a bit faster (it ignores the Pandas index). But why is the eval() approach using numexpr so slow? Shouldn't it be faster than at least the first approach? The documentation sure makes it seem like it would be: https://pandas.pydata.org/pandas-docs/stable/enhancingperf.html


Solution

  • From the investigation presented below, it looks like the unspectacular reason for the worse performance is "overhead".

    Only a small part of the expression y[(10 < z) & (z < 50)].mean() is done via numexpr-module. numexpr doesn't support indexing, thus we can only hope for (10 < z) & (z < 50) to be speed-up - anything else will be mapped to pandas-operations.

    However, (10 < z) & (z < 50) is not the bottle-neck here, as can be easily seen:

    %timeit df.y[(10 < df.z) & (df.z < 50)].mean()  # 16.7 ms
    mask=(10 < df.z) & (df.z < 50)                  
    %timeit df.y[mask].mean()                       # 13.7 ms
    %timeit df.y[mask]                              # 13.2 ms
    

    df.y[mask] -takes the lion's share of the running time.

    We can compare the profiler output for df.y[mask] and df.eval('y[mask]') to see what makes the difference.

    When I use the following script:

    import numpy as np
    import pandas as pd
    
    COUNT = 1000000
    
    df = pd.DataFrame({
        'y': np.random.normal(0, 1, COUNT),
        'z': np.random.gamma(50, 1, COUNT),
    })
    
    mask = (10 < df.z) & (df.z < 50)
    df['m']=mask
    
    for _ in range(500):
       df.y[df.m] 
       # OR 
       #df.eval('y[m]', engine='numexpr')
    

    and run it with python -m cProfile -s cumulative run.py (or %prun -s cumulative <...> in IPython), I can see the following profiles.

    For direct call of the pandas functionality:

       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        419/1    0.013    0.000    7.228    7.228 {built-in method builtins.exec}
            1    0.006    0.006    7.228    7.228 run.py:1(<module>)
          500    0.005    0.000    6.589    0.013 series.py:764(__getitem__)
          500    0.003    0.000    6.475    0.013 series.py:812(_get_with)
          500    0.003    0.000    6.468    0.013 series.py:875(_get_values)
          500    0.009    0.000    6.445    0.013 internals.py:4702(get_slice)
          500    0.006    0.000    3.246    0.006 range.py:491(__getitem__)
          505    3.146    0.006    3.236    0.006 base.py:2067(__getitem__)
          500    3.170    0.006    3.170    0.006 internals.py:310(_slice)
        635/2    0.003    0.000    0.414    0.207 <frozen importlib._bootstrap>:958(_find_and_load)
    

    We can see that almost 100% of the time is spent in series.__getitem__ without any overhead.

    For the call via df.eval(...), the situation is quite different:

       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        453/1    0.013    0.000   12.702   12.702 {built-in method builtins.exec}
            1    0.015    0.015   12.702   12.702 run.py:1(<module>)
          500    0.013    0.000   12.090    0.024 frame.py:2861(eval)
     1000/500    0.025    0.000   10.319    0.021 eval.py:153(eval)
     1000/500    0.007    0.000    9.247    0.018 expr.py:731(__init__)
     1000/500    0.004    0.000    9.236    0.018 expr.py:754(parse)
     4500/500    0.019    0.000    9.233    0.018 expr.py:307(visit)
     1000/500    0.003    0.000    9.105    0.018 expr.py:323(visit_Module)
     1000/500    0.002    0.000    9.102    0.018 expr.py:329(visit_Expr)
          500    0.011    0.000    9.096    0.018 expr.py:461(visit_Subscript)
          500    0.007    0.000    6.874    0.014 series.py:764(__getitem__)
          500    0.003    0.000    6.748    0.013 series.py:812(_get_with)
          500    0.004    0.000    6.742    0.013 series.py:875(_get_values)
          500    0.009    0.000    6.717    0.013 internals.py:4702(get_slice)
          500    0.006    0.000    3.404    0.007 range.py:491(__getitem__)
          506    3.289    0.007    3.391    0.007 base.py:2067(__getitem__)
          500    3.282    0.007    3.282    0.007 internals.py:310(_slice)
          500    0.003    0.000    1.730    0.003 generic.py:432(_get_index_resolvers)
         1000    0.014    0.000    1.725    0.002 generic.py:402(_get_axis_resolvers)
         2000    0.018    0.000    1.685    0.001 base.py:1179(to_series)
         1000    0.003    0.000    1.537    0.002 scope.py:21(_ensure_scope)
         1000    0.014    0.000    1.534    0.002 scope.py:102(__init__)
          500    0.005    0.000    1.476    0.003 scope.py:242(update)
          500    0.002    0.000    1.451    0.003 inspect.py:1489(stack)
          500    0.021    0.000    1.449    0.003 inspect.py:1461(getouterframes)
        11000    0.062    0.000    1.415    0.000 inspect.py:1422(getframeinfo)
         2000    0.008    0.000    1.276    0.001 base.py:1253(_to_embed)
         2035    1.261    0.001    1.261    0.001 {method 'copy' of 'numpy.ndarray' objects}
         1000    0.015    0.000    1.226    0.001 engines.py:61(evaluate)
        11000    0.081    0.000    1.081    0.000 inspect.py:757(findsource)
    

    once again about 7 seconds are spent in series.__getitem__, but there are also about 6 seconds overhead - for example about 2 seconds in frame.py:2861(eval) and about 2 seconds in expr.py:461(visit_Subscript).

    I did only a superficial investigation (see more details further below), but this overhead doesn't seems to be just constant but at least linear in the number of element in the series. For example there is method 'copy' of 'numpy.ndarray' objects which means that data is copied (it is quite unclear, why this would be necessary per se).

    My take-away from it: using pd.eval has advantages as long as the evaluated expression can be evaluated with numexpr alone. As soon as this is not the case, there might be no longer gains but losses due to quite large overhead.


    Using line_profiler (here I use %lprun-magic (after loading it with %load_ext line_profliler) for the function run() which is more or less a copy from the script above) we can easily find where the time is lost in Frame.eval:

    %lprun -f pd.core.frame.DataFrame.eval
           -f pd.core.frame.DataFrame._get_index_resolvers 
           -f pd.core.frame.DataFrame._get_axis_resolvers  
           -f pd.core.indexes.base.Index.to_series 
           -f pd.core.indexes.base.Index._to_embed
           run()
    

    Here we can see were the additional 10% are spent:

    Line #      Hits         Time  Per Hit   % Time  Line Contents
    ==============================================================
      2861                                               def eval(self, expr, 
    ....
      2951        10        206.0     20.6      0.0          from pandas.core.computation.eval import eval as _eval
      2952                                           
      2953        10        176.0     17.6      0.0          inplace = validate_bool_kwarg(inplace, 'inplace')
      2954        10         30.0      3.0      0.0          resolvers = kwargs.pop('resolvers', None)
      2955        10         37.0      3.7      0.0          kwargs['level'] = kwargs.pop('level', 0) + 1
      2956        10         17.0      1.7      0.0          if resolvers is None:
      2957        10     235850.0  23585.0      9.0              index_resolvers = self._get_index_resolvers()
      2958        10       2231.0    223.1      0.1              resolvers = dict(self.iteritems()), index_resolvers
      2959        10         29.0      2.9      0.0          if 'target' not in kwargs:
      2960        10         19.0      1.9      0.0              kwargs['target'] = self
      2961        10         46.0      4.6      0.0          kwargs['resolvers'] = kwargs.get('resolvers', ()) + tuple(resolvers)
      2962        10    2392725.0 239272.5     90.9          return _eval(expr, inplace=inplace, **kwargs)
    

    and _get_index_resolvers() can be drilled down to Index._to_embed:

    Line #      Hits         Time  Per Hit   % Time  Line Contents
    ==============================================================
      1253                                               def _to_embed(self, keep_tz=False, dtype=None):
      1254                                                   """
      1255                                                   *this is an internal non-public method*
      1256                                           
      1257                                                   return an array repr of this object, potentially casting to object
      1258                                           
      1259                                                   """
      1260        40         73.0      1.8      0.0          if dtype is not None:
      1261                                                       return self.astype(dtype)._to_embed(keep_tz=keep_tz)
      1262                                           
      1263        40     201490.0   5037.2    100.0          return self.values.copy()
    

    Where the O(n)-copying happens.