pythonpandasdataframeperformanceiterator

Why is Pandas itertuples slower than iterrows on dataframes with many (>100) columns?


In the unfortunate situation where looping over the rows of a Pandas dataframe is the only way to proceed, it's usually mentioned that itertuples() is to be preferred over iterrows() in terms of computational speed. This assertion appears valid for dataframes with few columns ('narrow dataframes'), but doesn't seem to be the case for wide dataframes with hundreds of columns.

Is this normal ? Shouldn't itertuples behave like iterrows in terms of scaling with column number, i.e. constant time rather than linear time ?

Attached below is the code snippet showing the cross-over from itertuples() to iterrows() been the fastest way to iterate when the dataframe width increases.

import pandas as pd
import time
from pylab import *
size_range = [100, 300, 600, 1200]
nrows = 100000
compute_time_rows = zeros(len(size_range))
compute_time_tuples = zeros(len(size_range))

for s, size in enumerate(size_range):
  x = pd.DataFrame(randn(nrows, size))
  start = time.time()
  for idx, row in x.iterrows(): z = sum(row)
  stop = time.time()
  compute_time_rows[s] = stop - start
  start = time.time()
  for row in x.itertuples(index=False): z = sum(row)
  stop = time.time()
  compute_time_tuples[s] = stop - start

xlabel('Dataframe width')
ylabel('Computation time [s]')
pd.Series(compute_time_rows, index=size_range).plot(grid=True, label='iterrows')
pd.Series(compute_time_tuples, index=size_range).plot(grid=True, label='itertuples')
title(f'Iteration over a {nrows} rows Pandas dataframe')
legend()

[Row iteration speed vs dataframe width] https://i.sstatic.net/65QuPfjB.png


Solution

  • First of all, you should not use either of them. Iterating through a dataframe in a Python loop is painfully slow no matter how you do it. You can only use them if you don't care about the performance. That being said, I will answer your question based on my technical curiosity.

    Both APIs are helper functions implemented in Python, so you can have a look.

    The following are the relevant parts. You can check that these show the same pattern as the original if you want to.

    import pandas as pd
    
    def iterrows(df: pd.DataFrame):
        for row in df.values:
            yield pd.Series(row)
    
    
    def itertuples(df: pd.DataFrame):
        cols = [df.iloc[:, k] for k in range(len(df.columns))]
        return zip(*cols)
    

    From the above code, we can find two major factors.

    The first thing is that iterrows does not make copies (in this case). df.values returns a single numpy array. If you iterate over it, each row is a view, not a copy, of the original array. And when you pass a numpy array to pd.Series, it uses it as an internal buffer, so it's also a view. Therefore, iterrows yields a views of the dataframe. On the other hand, itertuples creates views of columns first, but leaves the rest to Python's builtin zip function, which makes a copy.

    Second, since ziping columns is equivalent to creating one iterator per column, the iteration overhead is multiplied (compared to iterrows). So itertuples is not only linear, it is also simply slow. To be fair, I don't think itertuples was designed to handle such a large number of columns, considering that it is much faster for small numbers of columns.

    To confirm this hypothesis, I will show two examples. The first, in iterrows, yields a tuple instead of a pd.Series. This makes iterrows linear, since it forces a copy.

    def iterrows_2(df: pd.DataFrame):
        for row in df.values:
            # yield pd.Series(row)
            yield tuple(row)
    

    iterrows_2 vs rests

    With this simple change, the performance characteristics of iterrows are now very similar to those of itertuples (but somehow faster).

    The second, in itertuples, converts to a single numpy array instead of using zip.

    def itertuples_2(df: pd.DataFrame):
        cols = [df.iloc[:, k] for k in range(len(df.columns))]
        # return zip(*cols)
        return np.array(cols).T  # This is only possible for the single dtype dataframe.
    

    itertuples vs rests

    Both creating cols and converting it to a numpy array take linear time, so it's still linear in overall, but much faster. In other words, you can see how slow zip(*cols) is.

    Here is the benchmark code:

    import time
    
    import matplotlib.pyplot as plt
    import numpy as np
    import pandas as pd
    
    
    def iterrows(df: pd.DataFrame):
        for row in df.values:
            yield pd.Series(row)
    
    
    def iterrows_2(df: pd.DataFrame):
        for row in df.values:
            # yield pd.Series(row)
            yield tuple(row)
    
    
    def itertuples(df: pd.DataFrame):
        cols = [df.iloc[:, k] for k in range(len(df.columns))]
        return zip(*cols)
    
    
    def itertuples_2(df: pd.DataFrame):
        cols = [df.iloc[:, k] for k in range(len(df.columns))]
        # return zip(*cols)
        return np.array(cols).T  # This is only possible for the single dtype dataframe.
    
    
    def benchmark(candidates):
        size_range = [100, 300, 600, 900, 1200]
        nrows = 100000
    
        times = {k: [] for k in candidates}
        for size in size_range:
            for func in candidates:
                x = pd.DataFrame(np.random.randn(nrows, size))
                start = time.perf_counter()
                for row in func(x):
                    pass
                stop = time.perf_counter()
                times[func].append(stop - start)
    
        # Plot
        plt.figure()
        for func in candidates:
            s = pd.Series(times[func], index=size_range, name=func.__name__)
            plt.plot(s.index, s.values, marker="o", label=s.name)
        plt.title(f"iterrows vs itertuples ({nrows} rows)")
        plt.xlabel("Columns")
        plt.ylabel("Time [s]")
        plt.legend()
        plt.grid(True)
        plt.tight_layout()
        plt.show()
    
    
    benchmark([iterrows, itertuples, iterrows_2])
    benchmark([iterrows, itertuples, itertuples_2])
    

    FYI, note that iterrows yields a view in your benchmark, but this is not always the case. For example, if you convert the first column to a string type, it will make a copy, and iterrows will take linear time.

    x = pd.DataFrame(np.random.randn(nrows, size))
    x[0] = x[0].astype(str)