I hope someone can help me understand the difference between the "single" and "table" methods in pandas.DataFrame.quantile?
Whether to compute quantiles per-column (‘single’) or over all columns (‘table’). https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.quantile.html
For example, the following code yields the same results.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.array([[1, 1], [2, 10], [3, 100], [4, 100]]), columns=["a", "b"])
print(df.quantile(method="single", interpolation="nearest"))
print(df.quantile(method="table", interpolation="nearest"))
a 3
b 100
Name: 0.5, dtype: int64
a 3
b 100
Name: 0.5, dtype: int64
If you invert the order of the values in b
the result will be different:
df['b'] = df['b'].values[::-1]
print(df.quantile(method='single', interpolation='nearest'))
print(df.quantile(method='table', interpolation='nearest'))
Output:
a 3
b 100
Name: 0.5, dtype: int64
a 3
b 10
Name: 0.5, dtype: int64
The exact difference in behavior is a bit tricky to understand.
This happens in core/frame.py
:
if method == "single":
res = data._mgr.quantile(qs=q, interpolation=interpolation)
elif method == "table":
valid_interpolation = {"nearest", "lower", "higher"}
if interpolation not in valid_interpolation:
raise ValueError(
f"Invalid interpolation: {interpolation}. "
f"Interpolation must be in {valid_interpolation}"
)
# handle degenerate case
if len(data) == 0:
if data.ndim == 2:
dtype = find_common_type(list(self.dtypes))
else:
dtype = self.dtype
return self._constructor([], index=q, columns=data.columns, dtype=dtype)
q_idx = np.quantile(np.arange(len(data)), q, method=interpolation)
by = data.columns
if len(by) > 1:
keys = [data._get_label_or_level_values(x) for x in by]
indexer = lexsort_indexer(keys)
else:
k = data._get_label_or_level_values(by[0])
indexer = nargsort(k)
res = data._mgr.take(indexer[q_idx], verify=False)
res.axes[1] = q
result = self._constructor_from_mgr(res, axes=res.axes)
return result.__finalize__(self, method="quantile")
In short, if you use method='single'
, this computes:
np.percentile(a, 50, axis=0, method='nearest')
# array([ 3, 100])
With method='table'
, this computes the quantile on the indexer (np.arange(len(df))
). The result with be an integer (q_idx
) between 0 and len(df)
. The, this sorts the rows with lexsort_indexer
, giving priority to the first column(s), and finally takes the (q_idx)th row:
data = df
q = 0.5
interpolation = 'nearest'
q_idx = np.quantile(np.arange(len(data)), q, method=interpolation)
# 2
indexer = lexsort_indexer([data._get_label_or_level_values(x) for x in data.columns])
# array([0, 1, 2, 3])
df.iloc[indexer]
# a b
# 0 1 100
# 1 2 100
# 2 3 10 # this row will be picked
# 3 4 1
data.iloc[indexer[q_idx]]
# a 3
# b 10
# Name: 2, dtype: int64
This means that the result is dependent on the order of the columns:
print(df.quantile(method='table', interpolation='nearest'))
a 3
b 10
Name: 0.5, dtype: int64
# now let's give b priority over a
print(df.iloc[:, ::-1].quantile(method='table', interpolation='nearest'))
b 100
a 1
Name: 0.5, dtype: int64