pythonpandasdataframenumpy

Difference between "single" and "table" methods in pandas.DataFrame.quantile


I hope someone can help me understand the difference between the "single" and "table" methods in pandas.DataFrame.quantile?

Whether to compute quantiles per-column (‘single’) or over all columns (‘table’). https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.quantile.html

For example, the following code yields the same results.

import numpy as np
import pandas as pd

df = pd.DataFrame(np.array([[1, 1], [2, 10], [3, 100], [4, 100]]), columns=["a", "b"])

print(df.quantile(method="single", interpolation="nearest"))
print(df.quantile(method="table", interpolation="nearest"))
a      3
b    100
Name: 0.5, dtype: int64
a      3
b    100
Name: 0.5, dtype: int64

Solution

  • If you invert the order of the values in b the result will be different:

    df['b'] = df['b'].values[::-1]
    
    print(df.quantile(method='single', interpolation='nearest'))
    print(df.quantile(method='table', interpolation='nearest'))
    

    Output:

    a      3
    b    100
    Name: 0.5, dtype: int64
    
    a     3
    b    10
    Name: 0.5, dtype: int64
    

    The exact difference in behavior is a bit tricky to understand.

    This happens in core/frame.py:

            if method == "single":
                res = data._mgr.quantile(qs=q, interpolation=interpolation)
            elif method == "table":
                valid_interpolation = {"nearest", "lower", "higher"}
                if interpolation not in valid_interpolation:
                    raise ValueError(
                        f"Invalid interpolation: {interpolation}. "
                        f"Interpolation must be in {valid_interpolation}"
                    )
                # handle degenerate case
                if len(data) == 0:
                    if data.ndim == 2:
                        dtype = find_common_type(list(self.dtypes))
                    else:
                        dtype = self.dtype
                    return self._constructor([], index=q, columns=data.columns, dtype=dtype)
    
                q_idx = np.quantile(np.arange(len(data)), q, method=interpolation)
    
                by = data.columns
                if len(by) > 1:
                    keys = [data._get_label_or_level_values(x) for x in by]
                    indexer = lexsort_indexer(keys)
                else:
                    k = data._get_label_or_level_values(by[0])
                    indexer = nargsort(k)
    
                res = data._mgr.take(indexer[q_idx], verify=False)
                res.axes[1] = q
    
            result = self._constructor_from_mgr(res, axes=res.axes)
            return result.__finalize__(self, method="quantile")
    

    In short, if you use method='single', this computes:

    np.percentile(a, 50, axis=0, method='nearest')
    # array([  3, 100])
    

    With method='table', this computes the quantile on the indexer (np.arange(len(df))). The result with be an integer (q_idx) between 0 and len(df). The, this sorts the rows with lexsort_indexer, giving priority to the first column(s), and finally takes the (q_idx)th row:

    data = df
    q = 0.5
    interpolation = 'nearest'
    
    q_idx = np.quantile(np.arange(len(data)), q, method=interpolation)
    # 2
    
    indexer = lexsort_indexer([data._get_label_or_level_values(x) for x in data.columns])
    # array([0, 1, 2, 3])
    
    df.iloc[indexer]
    #    a    b
    # 0  1  100
    # 1  2  100
    # 2  3   10   # this row will be picked
    # 3  4    1
    
    data.iloc[indexer[q_idx]]
    # a     3
    # b    10
    # Name: 2, dtype: int64
    

    This means that the result is dependent on the order of the columns:

    print(df.quantile(method='table', interpolation='nearest'))
    
    a     3
    b    10
    Name: 0.5, dtype: int64
    
    # now let's give b priority over a
    print(df.iloc[:, ::-1].quantile(method='table', interpolation='nearest'))
    
    b    100
    a      1
    Name: 0.5, dtype: int64