In the unfortunate situation where looping over the rows of a Pandas dataframe is the only way to proceed, it's usually mentioned that itertuples() is to be preferred over iterrows() in terms of computational speed. This assertion appears valid for dataframes with few columns ('narrow dataframes'), but doesn't seem to be the case for wide dataframes with hundreds of columns.
Is this normal ? Shouldn't itertuples behave like iterrows in terms of scaling with column number, i.e. constant time rather than linear time ?
Attached below is the code snippet showing the cross-over from itertuples() to iterrows() been the fastest way to iterate when the dataframe width increases.
import pandas as pd
import time
from pylab import *
size_range = [100, 300, 600, 1200]
nrows = 100000
compute_time_rows = zeros(len(size_range))
compute_time_tuples = zeros(len(size_range))
for s, size in enumerate(size_range):
x = pd.DataFrame(randn(nrows, size))
start = time.time()
for idx, row in x.iterrows(): z = sum(row)
stop = time.time()
compute_time_rows[s] = stop - start
start = time.time()
for row in x.itertuples(index=False): z = sum(row)
stop = time.time()
compute_time_tuples[s] = stop - start
xlabel('Dataframe width')
ylabel('Computation time [s]')
pd.Series(compute_time_rows, index=size_range).plot(grid=True, label='iterrows')
pd.Series(compute_time_tuples, index=size_range).plot(grid=True, label='itertuples')
title(f'Iteration over a {nrows} rows Pandas dataframe')
legend()
[Row iteration speed vs dataframe width] https://i.sstatic.net/65QuPfjB.png
First of all, you should not use either of them. Iterating through a dataframe in a Python loop is painfully slow no matter how you do it. You can only use them if you don't care about the performance. That being said, I will answer your question based on my technical curiosity.
Both APIs are helper functions implemented in Python, so you can have a look.
The following are the relevant parts. You can check that these show the same pattern as the original if you want to.
import pandas as pd
def iterrows(df: pd.DataFrame):
for row in df.values:
yield pd.Series(row)
def itertuples(df: pd.DataFrame):
cols = [df.iloc[:, k] for k in range(len(df.columns))]
return zip(*cols)
From the above code, we can find two major factors.
The first thing is that iterrows does not make copies (in this case).
df.values
returns a single numpy array. If you iterate over it, each row is a view, not a copy, of the original array.
And when you pass a numpy array to pd.Series
, it uses it as an internal buffer, so it's also a view.
Therefore, iterrows yields a views of the dataframe.
On the other hand, itertuples creates views of columns first, but leaves the rest to Python's builtin zip
function, which makes a copy.
Second, since zip
ing columns is equivalent to creating one iterator per column, the iteration overhead is multiplied (compared to iterrows).
So itertuples is not only linear, it is also simply slow.
To be fair, I don't think itertuples was designed to handle such a large number of columns, considering that it is much faster for small numbers of columns.
To confirm this hypothesis, I will show two examples.
The first, in iterrows, yields a tuple
instead of a pd.Series
.
This makes iterrows linear, since it forces a copy.
def iterrows_2(df: pd.DataFrame):
for row in df.values:
# yield pd.Series(row)
yield tuple(row)
With this simple change, the performance characteristics of iterrows are now very similar to those of itertuples (but somehow faster).
The second, in itertuples, converts to a single numpy array instead of using zip
.
def itertuples_2(df: pd.DataFrame):
cols = [df.iloc[:, k] for k in range(len(df.columns))]
# return zip(*cols)
return np.array(cols).T # This is only possible for the single dtype dataframe.
Both creating cols
and converting it to a numpy array take linear time, so it's still linear in overall, but much faster.
In other words, you can see how slow zip(*cols)
is.
Here is the benchmark code:
import time
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
def iterrows(df: pd.DataFrame):
for row in df.values:
yield pd.Series(row)
def iterrows_2(df: pd.DataFrame):
for row in df.values:
# yield pd.Series(row)
yield tuple(row)
def itertuples(df: pd.DataFrame):
cols = [df.iloc[:, k] for k in range(len(df.columns))]
return zip(*cols)
def itertuples_2(df: pd.DataFrame):
cols = [df.iloc[:, k] for k in range(len(df.columns))]
# return zip(*cols)
return np.array(cols).T # This is only possible for the single dtype dataframe.
def benchmark(candidates):
size_range = [100, 300, 600, 900, 1200]
nrows = 100000
times = {k: [] for k in candidates}
for size in size_range:
for func in candidates:
x = pd.DataFrame(np.random.randn(nrows, size))
start = time.perf_counter()
for row in func(x):
pass
stop = time.perf_counter()
times[func].append(stop - start)
# Plot
plt.figure()
for func in candidates:
s = pd.Series(times[func], index=size_range, name=func.__name__)
plt.plot(s.index, s.values, marker="o", label=s.name)
plt.title(f"iterrows vs itertuples ({nrows} rows)")
plt.xlabel("Columns")
plt.ylabel("Time [s]")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
benchmark([iterrows, itertuples, iterrows_2])
benchmark([iterrows, itertuples, itertuples_2])
FYI, note that iterrows yields a view in your benchmark, but this is not always the case. For example, if you convert the first column to a string type, it will make a copy, and iterrows will take linear time.
x = pd.DataFrame(np.random.randn(nrows, size))
x[0] = x[0].astype(str)