pythonperformancepandasdictionary

Pandas replace/dictionary slowness


Please help me understand why this "replace from dictionary" operation is slow in Python/Pandas:

# Series has 200 rows and 1 column
# Dictionary has 11269 key-value pairs
series.replace(dictionary, inplace=True)

Dictionary lookups should be O(1). Replacing a value in a column should be O(1). Isn't this a vectorized operation? Even if it's not vectorized, iterating 200 rows is only 200 iterations, so how can it be slow?

Here is a SSCCE demonstrating the issue:

import pandas as pd
import random

# Initialize dummy data
dictionary = {}
orig = []
for x in range(11270):
    dictionary[x] = 'Some string ' + str(x)
for x in range(200):
    orig.append(random.randint(1, 11269))
series = pd.Series(orig)

# The actual operation we care about
print('Starting...')
series.replace(dictionary, inplace=True)
print('Done.')

Running that command takes more than 1 second on my machine, which is 1000's of times longer than expected to perform <1000 operations.


Solution

  • It looks like replace has a bit of overhead, and explicitly telling the Series what to do via map yields the best performance:

    series = series.map(lambda x: dictionary.get(x,x))
    

    If you're sure that all keys are in your dictionary you can get a very slight performance boost by not creating a lambda, and directly supplying the dictionary.get function. Any keys that are not present will return NaN via this method, so beware:

    series = series.map(dictionary.get)
    

    You can also supply just the dictionary itself, but this appears to introduce a bit of overhead:

    series = series.map(dictionary)
    

    Timings

    Some timing comparisons using your example data:

    %timeit series.map(dictionary.get)
    10000 loops, best of 3: 124 µs per loop
    
    %timeit series.map(lambda x: dictionary.get(x,x))
    10000 loops, best of 3: 150 µs per loop
    
    %timeit series.map(dictionary)
    100 loops, best of 3: 5.45 ms per loop
    
    %timeit series.replace(dictionary)
    1 loop, best of 3: 1.23 s per loop