Difference between "single" and "table" methods in pandas.DataFrame.quantile

I hope someone can help me understand the difference between the "single" and "table" methods in pandas.DataFrame.quantile?

Whether to compute quantiles per-column (‘single’) or over all columns (‘table’). https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.quantile.html

For example, the following code yields the same results.

import numpy as np
import pandas as pd

df = pd.DataFrame(np.array([[1, 1], [2, 10], [3, 100], [4, 100]]), columns=["a", "b"])

print(df.quantile(method="single", interpolation="nearest"))
print(df.quantile(method="table", interpolation="nearest"))

a      3
b    100
Name: 0.5, dtype: int64
a      3
b    100
Name: 0.5, dtype: int64

Solution

If you invert the order of the values in b the result will be different:

df['b'] = df['b'].values[::-1]

print(df.quantile(method='single', interpolation='nearest'))
print(df.quantile(method='table', interpolation='nearest'))

Output:

a      3
b    100
Name: 0.5, dtype: int64

a     3
b    10
Name: 0.5, dtype: int64

The exact difference in behavior is a bit tricky to understand.

This happens in core/frame.py:

        if method == "single":
            res = data._mgr.quantile(qs=q, interpolation=interpolation)
        elif method == "table":
            valid_interpolation = {"nearest", "lower", "higher"}
            if interpolation not in valid_interpolation:
                raise ValueError(
                    f"Invalid interpolation: {interpolation}. "
                    f"Interpolation must be in {valid_interpolation}"
                )
            # handle degenerate case
            if len(data) == 0:
                if data.ndim == 2:
                    dtype = find_common_type(list(self.dtypes))
                else:
                    dtype = self.dtype
                return self._constructor([], index=q, columns=data.columns, dtype=dtype)

            q_idx = np.quantile(np.arange(len(data)), q, method=interpolation)

            by = data.columns
            if len(by) > 1:
                keys = [data._get_label_or_level_values(x) for x in by]
                indexer = lexsort_indexer(keys)
            else:
                k = data._get_label_or_level_values(by[0])
                indexer = nargsort(k)

            res = data._mgr.take(indexer[q_idx], verify=False)
            res.axes[1] = q

        result = self._constructor_from_mgr(res, axes=res.axes)
        return result.__finalize__(self, method="quantile")

In short, if you use method='single', this computes:

np.percentile(a, 50, axis=0, method='nearest')
# array([  3, 100])

With method='table', this computes the quantile on the indexer (np.arange(len(df))). The result with be an integer (q_idx) between 0 and len(df). The, this sorts the rows with lexsort_indexer, giving priority to the first column(s), and finally takes the (q_idx)th row:

data = df
q = 0.5
interpolation = 'nearest'

q_idx = np.quantile(np.arange(len(data)), q, method=interpolation)
# 2

indexer = lexsort_indexer([data._get_label_or_level_values(x) for x in data.columns])
# array([0, 1, 2, 3])

df.iloc[indexer]
#    a    b
# 0  1  100
# 1  2  100
# 2  3   10   # this row will be picked
# 3  4    1

data.iloc[indexer[q_idx]]
# a     3
# b    10
# Name: 2, dtype: int64

This means that the result is dependent on the order of the columns:

print(df.quantile(method='table', interpolation='nearest'))

a     3
b    10
Name: 0.5, dtype: int64

# now let's give b priority over a
print(df.iloc[:, ::-1].quantile(method='table', interpolation='nearest'))

b    100
a      1
Name: 0.5, dtype: int64