pythonpandasperformancepandas-groupbycolumn-major-order

How to keep major-order when copying or groupby-ing a pandas DataFrame?


How can I use or manipulate (monkey-patch) pandas in order, to keep always the same major-order on the resulting object for copy and groupby aggregations?

I use pandas.DataFrame as datastructure within a business application (risk model) and need fast aggregation of multidimensional data. Aggregation with pandas depends crucially on the major-ordering scheme in use on the underlying numpy array.

Unfortunatly, pandas (version 0.23.4) changes the major-order of the underlying numpy array when I create a copy or when I perform an aggregation with groupby and sum.

The impact is:

case 1: 17.2 seconds

case 2: 5 min 46 s seconds

on a DataFrame and its copy with 45023 rows and 100000 columns. Aggregation was performed on the index. The index is a pd.MultiIndex with 15 levels. Aggregation keeps three levels and leads to about 239 groups.

I work typically on DataFrames with 45000 rows and 100000 columns. On the row I have a pandas.MultiIndex with about 15 levels. To compute statistics on various hierarchy nodes I need to aggregate (sum) on the index dimension.

Aggregation is fast, if the underlying numpy array is c_contiguous, hence held in column-major-order (C order). It is very slow if it is f_contiguous, hence in row-major-order (F order).

Unfortunatly, pandas changes the the major-order from C to F when

Sure, I could stick to another 'datamodel', just by keeping the MultiIndex on the columns. Then the current pandas version would always work to my favor. But this is a no go. I think, that one can expect, that for the two operations under consideration (groupby-sum and copy) the major-order should not be changed.

import numpy as np
import pandas as pd

print("pandas version: ", pd.__version__)

array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
array.flags
print("Numpy array is C-contiguous: ", data.flags.c_contiguous)

dataframe = pd.DataFrame(array, index = pd.MultiIndex.from_tuples([('A', 'U'), ('A', 'V'), ('B', 'W')], names=['dim_one', 'dim_two']))
print("DataFrame is C-contiguous: ", dataframe.values.flags.c_contiguous)

dataframe_copy = dataframe.copy()
print("Copy of DataFrame is C-contiguous: ", dataframe_copy.values.flags.c_contiguous)

aggregated_dataframe = dataframe.groupby('dim_one').sum()
print("Aggregated DataFrame is C-contiguous: ", aggregated_dataframe.values.flags.c_contiguous)


## Output in Jupyter Notebook
# pandas version:  0.23.4
# Numpy array is C-contiguous:  True
# DataFrame is C-contiguous:  True
# Copy of DataFrame is C-contiguous:  False
# Aggregated DataFrame is C-contiguous:  False

The major order of the data should be preserved. If pandas likes to switch to an implicit preference, then it should allow to overwrite this. Numpy allows to input the order when creating a copy.

A patched version of pandas should result in

## Output in Jupyter Notebook
# pandas version:  0.23.4
# Numpy array is C-contiguous:  True
# DataFrame is C-contiguous:  True
# Copy of DataFrame is C-contiguous:  True
# Aggregated DataFrame is C-contiguous:  True

for the example code snipped above.


Solution

  • Monkey Patch for Pandas (0.23.4 and maybe other versions too)

    I created a patch which I would like to share with you. It results in the performance increase mentioned in the question above.

    It works for pandas version 0.23.4. For other versions you need to try whether it still works.

    The following two modules are needed, you might adapt the imports depending on where you put them.

    memory_layout.py   
    memory.py
    

    To patch your code you simply need to import the following at the very beginning of your program or notebook and to set the memory layout parameter. It will monkey patch pandas and make sure, that copies of DataFrames behave have the requested layout.

    from memory_layout import memory_layout
    # memory_layout.order = 'F'  # assert F-order on copy
    # memory_layout.order = 'K'  # Keep given layout on copy 
    memory_layout.order = 'C'  # assert C-order on copy
    

    memory_layout.py

    Create file memory_layout.py with the following content.

    import numpy as np
    from pandas.core.internals import Block
    from memory import memory_layout
    
    # memory_layout.order = 'F'  # set memory layout order to 'F' for np.ndarrays in DataFrame copies (fortran/row order)
    # memory_layout.order = 'K'  # keep memory layout order for np.ndarrays in DataFrame copies (order out is order in)
    memory_layout.order = 'C'  # set memory layout order to 'C' for np.ndarrays in DataFrame copies (C/column order)
    
    
    def copy(self, deep=True, mgr=None):
        """
        Copy patch on Blocks to set or keep the memory layout
        on copies.
    
        :param self: `pandas.core.internals.Block`
        :param deep: `bool`
        :param mgr: `BlockManager`
        :return: copy of `pandas.core.internals.Block`
        """
        values = self.values
        if deep:
            if isinstance(values, np.ndarray):
    memory_layout))
                values = memory_layout.copy_transposed(values)
    memory_layout))
            else:
                values = values.copy()
        return self.make_block_same_class(values)
    
    
    Block.copy = copy  # Block for pandas 0.23.4: in pandas.core.internals.Block
    
    

    memory.py

    Create file memory.py with the following content.

    """
    Implements MemoryLayout copy factory to change memory layout
    of `numpy.ndarrays`.
    Depending on the use case, operations on DataFrames can be much
    faster if the appropriate memory layout is set and preserved.
    
    The implementation allows for changing the desired layout. Changes apply when
    copies or new objects are created, as for example, when slicing or aggregating
    via groupby ...
    
    This implementation tries to solve the issue raised on GitHub
    https://github.com/pandas-dev/pandas/issues/26502
    
    """
    import numpy as np
    
    _DEFAULT_MEMORY_LAYOUT = 'K'
    
    
    class MemoryLayout(object):
        """
        Memory layout management for numpy.ndarrays.
    
        Singleton implementation.
    
        Example:
        >>> from memory import memory_layout
        >>> memory_layout.order = 'K'  #
        >>> # K ... keep array layout from input
        >>> # C ... set to c-contiguous / column order
        >>> # F ... set to f-contiguous / row order
        >>> array = memory_layout.apply(array)
        >>> array = memory_layout.apply(array, 'C')
        >>> array = memory_layout.copy(array)
        >>> array = memory_layout.apply_on_transpose(array)
    
        """
    
        _order = _DEFAULT_MEMORY_LAYOUT
        _instance = None
    
        @property
        def order(self):
            """
            Return memory layout ordering.
    
            :return: `str`
            """
            if self.__class__._order is None:
                raise AssertionError("Array layout order not set.")
            return self.__class__._order
    
        @order.setter
        def order(self, order):
            """
            Set memory layout order.
            Allowed values are 'C', 'F', and 'K'. Raises AssertionError
            when trying to set other values.
    
            :param order: `str`
            :return: `None`
            """
            assert order in ['C', 'F', 'K'], "Only 'C', 'F' and 'K' supported."
            self.__class__._order = order
    
        def __new__(cls):
            """
            Create only one instance throughout the lifetime of this process.
    
            :return: `MemoryLayout` instance as singleton
            """
            if cls._instance is None:
                cls._instance = super(MemoryLayout, cls).__new__(MemoryLayout)
            return cls._instance
    
        @staticmethod
        def get_from(array):
            """
            Get memory layout from array
    
            Possible values:
               'C' ... only C-contiguous or column order
               'F' ... only F-contiguous or row order
               'O' ... other: both, C- and F-contiguous or both
               not C- or F-contiguous (as on empty arrays).
    
            :param array: `numpy.ndarray`
            :return: `str`
            """
            if array.flags.c_contiguous == array.flags.f_contiguous:
                return 'O'
            return {True: 'C', False: 'F'}[array.flags.c_contiguous]
    
        def apply(self, array, order=None):
            """
            Apply the order set or the order given as input on the array
            given as input.
    
            Possible values:
               'C' ... apply C-contiguous layout or column order
               'F' ... apply F-contiguous layout or row order
               'K' ... keep the given layout
    
            :param array: `numpy.ndarray`
            :param order: `str`
            :return: `np.ndarray`
            """
            order = self.__class__._order if order is None else order
    
            if order == 'K':
                return array
    
            array_order = MemoryLayout.get_from(array)
            if array_order == order:
                return array
    
            return np.reshape(np.ravel(array), array.shape, order=order)
    
        def copy(self, array, order=None):
            """
            Return a copy of the input array with the memory layout set.
            Layout set:
               'C' ... return C-contiguous copy
               'F' ... return F-contiguous copy
               'K' ... return copy with same layout as
               given by the input array.
    
            :param array: `np.ndarray`
            :return: `np.ndarray`
            """
            order = order if order is not None else self.__class__._order
            return array.copy(order=self.get_from(array)) if order == 'K' \
                else array.copy(order=order)
    
        def copy_transposed(self, array):
            """
            Return a copy of the input array in order that its transpose
            has the memory layout set.
    
            Note: numpy simply changes the memory layout from row to column
            order instead of reshuffling the data in memory.
    
            Layout set:
               'C' ... return F-contiguous copy
               'F' ... return C-contiguous copy
               'K' ... return copy with oposite (C versus F) layout as
               given by the input array.
    
            :param array: `np.ndarray`
            :return: `np.ndarray`
    
            :param array:
            :return:
            """
            if self.__class__._order == 'K':
                return array.copy(
                    order={'C': 'C', 'F': 'F', 'O': None}[self.get_from(array)])
            else:
                return array.copy(
                    order={'C': 'F', 'F': 'C'}[self.__class__._order])
    
        def __str__(self):
            return str(self.__class__._order)
    
    
    memory_layout = MemoryLayout()  # Singleton