pythonpandaspypy

pypy pandas correlation slower than python


I just wanted to give a try PyPy for pandas operations and I was thinking to use some part of code might be faster with PyPy but apparently it is slower than python.

What is the reason behind that?

That's my code sample, just reads example data from csv and computes correlation.

with python: 7 minute
with pypy: 8.5 minute

import pandas as pd
import time

t = time.time()

df = pd.read_csv('./dfn.csv', index_col=0)

df.T.corr()

print(time.time()-t)

Solution

  • Much of the scientific python software stack actually is written in C/C++. So when you use pandas routines like read_csv or T.corr(), you are not hitting python code, rather compiled code. PyPy cannot speed that code up much. Additionally, the interfaces to the C/C++ code are currently written using the CPython C-API. In order for PyPy to use that code, it must emulate the CPython C-API which is slow. See this blog post for the reasons. We hope HPy will change that situation and make C/C++ interop on PyPy (and other python implementations) faster.