pythonpandasgroup-bypivotpivot-table

How can I pivot a dataframe?


How do I pivot the pandas dataframe df defined at bottom such that the col values become columns, row values become the index, and mean of val0 becomes the values? (in some cases this is called transforming from long-format to wide-format)

(See note at bottom: Why is this question not a duplicate? and why this is thematically one question and not too broad.)

Subquestions

  1. (How to avoid getting ValueError: Index contains duplicate entries, cannot reshape?)

  2. How do I pivot df defined at bottom, such that the col values become columns, row values become the index, and mean of val0 are the values?

    col   col0   col1   col2   col3  col4
    row
    row0  0.77  0.605    NaN  0.860  0.65
    row2  0.13    NaN  0.395  0.500  0.25
    row3   NaN  0.310    NaN  0.545   NaN
    row4   NaN  0.100  0.395  0.760  0.24
    

How do I pivot...

  1. ... so that missing values are 0?

    col   col0   col1   col2   col3  col4
    row
    row0  0.77  0.605  0.000  0.860  0.65
    row2  0.13  0.000  0.395  0.500  0.25
    row3  0.00  0.310  0.000  0.545  0.00
    row4  0.00  0.100  0.395  0.760  0.24
    
  2. ... to do an aggregate function other than mean, like sum?

    col   col0  col1  col2  col3  col4
    row
    row0  0.77  1.21  0.00  0.86  0.65
    row2  0.13  0.00  0.79  0.50  0.50
    row3  0.00  0.31  0.00  1.09  0.00
    row4  0.00  0.10  0.79  1.52  0.24
    
  3. ... to do more that one aggregation at a time?

           sum                          mean
    col   col0  col1  col2  col3  col4  col0   col1   col2   col3  col4
    row
    row0  0.77  1.21  0.00  0.86  0.65  0.77  0.605  0.000  0.860  0.65
    row2  0.13  0.00  0.79  0.50  0.50  0.13  0.000  0.395  0.500  0.25
    row3  0.00  0.31  0.00  1.09  0.00  0.00  0.310  0.000  0.545  0.00
    row4  0.00  0.10  0.79  1.52  0.24  0.00  0.100  0.395  0.760  0.24
    
  4. ... to aggregate over multiple 'value' columns?

          val0                             val1
    col   col0   col1   col2   col3  col4  col0   col1  col2   col3  col4
    row
    row0  0.77  0.605  0.000  0.860  0.65  0.01  0.745  0.00  0.010  0.02
    row2  0.13  0.000  0.395  0.500  0.25  0.45  0.000  0.34  0.440  0.79
    row3  0.00  0.310  0.000  0.545  0.00  0.00  0.230  0.00  0.075  0.00
    row4  0.00  0.100  0.395  0.760  0.24  0.00  0.070  0.42  0.300  0.46
    
  5. ... to subdivide by multiple columns? (item0,item1,item2..., col0,col1,col2...)

    item item0             item1                         item2
    col   col2  col3  col4  col0  col1  col2  col3  col4  col0   col1  col3  col4
    row
    row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.605  0.86  0.65
    row2  0.35  0.00  0.37  0.00  0.00  0.44  0.00  0.00  0.13  0.000  0.50  0.13
    row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.000  0.28  0.00
    row4  0.15  0.64  0.00  0.00  0.10  0.64  0.88  0.24  0.00  0.000  0.00  0.00
    
  6. ... to subdivide by multiple rows: (key0,key1... row0,row1,row2...)

    item      item0             item1                         item2
    col        col2  col3  col4  col0  col1  col2  col3  col4  col0  col1  col3  col4
    key  row
    key0 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.86  0.00
         row2  0.00  0.00  0.37  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.50  0.00
         row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.00  0.00  0.00
         row4  0.15  0.64  0.00  0.00  0.00  0.00  0.00  0.24  0.00  0.00  0.00  0.00
    key1 row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.81  0.00  0.65
         row2  0.35  0.00  0.00  0.00  0.00  0.44  0.00  0.00  0.00  0.00  0.00  0.13
         row3  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.28  0.00
         row4  0.00  0.00  0.00  0.00  0.10  0.00  0.00  0.00  0.00  0.00  0.00  0.00
    key2 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.40  0.00  0.00
         row2  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.13  0.00  0.00  0.00
         row4  0.00  0.00  0.00  0.00  0.00  0.64  0.88  0.00  0.00  0.00  0.00  0.00
    
  7. ... to aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?

    col   col0  col1  col2  col3  col4
    row
    row0     1     2     0     1     1
    row2     1     0     2     1     2
    row3     0     1     0     2     0
    row4     0     1     2     2     1
    
  8. ... to convert a DataFrame from long-to-wide by pivoting on ONLY two columns? Given:

    np.random.seed([3, 1415])
    df2 = pd.DataFrame({'A': list('aaaabbbc'), 'B': np.random.choice(15, 8)})
    df2
       A   B
    0  a   0
    1  a  11
    2  a   2
    3  a  11
    4  b  10
    5  b  10
    6  b  14
    7  c   7
    

    The expected should look something like

          a     b    c
    0   0.0  10.0  7.0
    1  11.0  10.0  NaN
    2   2.0  14.0  NaN
    3  11.0   NaN  NaN
    
  9. ... to flatten the multiple index to a single multi-index after pivot?

    From:

       1  2
       1  1  2
    a  2  1  1
    b  2  1  0
    c  1  0  0
    

    To:

       1|1  2|1  2|2
    a    2    1    1
    b    2    1    0
    c    1    0    0
    

Setup

Consider a dataframe df with columns 'key', 'row', 'item', 'col', and random float values 'val0', 'val1'. I conspicuously named the columns and relevant column values to correspond with how I want to pivot them.

import numpy as np
import pandas as pd
from numpy.core.defchararray import add

np.random.seed([3,1415])
n = 20

cols = np.array(['key', 'row', 'item', 'col'])
arr1 = (np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str)

df = pd.DataFrame(
    add(cols, arr1), columns=cols
).join(
    pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val')
)
print(df)
     key   row   item   col  val0  val1
0   key0  row3  item1  col3  0.81  0.04
1   key1  row2  item1  col2  0.44  0.07
2   key1  row0  item1  col0  0.77  0.01
3   key0  row4  item0  col2  0.15  0.59
4   key1  row0  item2  col1  0.81  0.64
5   key1  row2  item2  col4  0.13  0.88
6   key2  row4  item1  col3  0.88  0.39
7   key1  row4  item1  col1  0.10  0.07
8   key1  row0  item2  col4  0.65  0.02
9   key1  row2  item0  col2  0.35  0.61
10  key2  row0  item2  col1  0.40  0.85
11  key2  row4  item1  col2  0.64  0.25
12  key0  row2  item2  col3  0.50  0.44
13  key0  row4  item1  col4  0.24  0.46
14  key1  row3  item2  col3  0.28  0.11
15  key0  row3  item1  col1  0.31  0.23
16  key0  row0  item2  col3  0.86  0.01
17  key0  row4  item0  col3  0.64  0.21
18  key2  row2  item2  col0  0.13  0.45
19  key0  row2  item0  col4  0.37  0.70

Why is this question not a duplicate? and more useful than the following autosuggestions:

  1. How to pivot a dataframe in Pandas? only covers the specific case of 'Country' to row-index, values of 'Indicator' for 'Year' to multiple columns and no aggregation of values.

  2. pandas pivot table to data frame asks how to pivot in pandas like in R, i.e. autogenerate an individual column for each value of strength...

  3. pandas pivoting a dataframe, duplicate rows asks about the syntax for pivoting multiple columns, without needing to list them all.

None of the existing questions and answers are comprehensive, so this is an attempt at a canonical question and answer that encompasses all aspects of pivoting.


Solution

  • Here is a list of idioms we can use to pivot

    1. pd.DataFrame.pivot_table

      • A glorified version of groupby with more intuitive API. For many people, this is the preferred approach. And it is the intended approach by the developers.
      • Specify row level, column levels, values to be aggregated, and function(s) to perform aggregations.
    2. pd.DataFrame.groupby + pd.DataFrame.unstack

      • Good general approach for doing just about any type of pivot
      • You specify all columns that will constitute the pivoted row levels and column levels in one group by. You follow that by selecting the remaining columns you want to aggregate and the function(s) you want to perform the aggregation. Finally, you unstack the levels that you want to be in the column index.
    3. pd.DataFrame.set_index + pd.DataFrame.unstack

      • Convenient and intuitive for some (myself included). Cannot handle duplicate grouped keys.
      • Similar to the groupby paradigm, we specify all columns that will eventually be either row or column levels and set those to be the index. We then unstack the levels we want in the columns. If either the remaining index levels or column levels are not unique, this method will fail.
    4. pd.DataFrame.pivot

      • Very similar to set_index in that it shares the duplicate key limitation. The API is very limited as well. It only takes scalar values for index, columns, values.
      • Similar to the pivot_table method in that we select rows, columns, and values on which to pivot. However, we cannot aggregate and if either rows or columns are not unique, this method will fail.
    5. pd.crosstab

      • This a specialized version of pivot_table and in its purest form is the most intuitive way to perform several tasks.
    6. pd.factorize + np.bincount

      • This is a highly advanced technique that is very obscure but is very fast. It cannot be used in all circumstances, but when it can be used and you are comfortable using it, you will reap the performance rewards.
    7. pd.get_dummies + pd.DataFrame.dot

      • I use this for cleverly performing cross tabulation.

    See also:


    Question 1

    Why do I get ValueError: Index contains duplicate entries, cannot reshape

    This occurs because pandas is attempting to reindex either a columns or index object with duplicate entries. There are varying methods to use that can perform a pivot. Some of them are not well suited to when there are duplicates of the keys on which it is being asked to pivot. For example: Consider pd.DataFrame.pivot. I know there are duplicate entries that share the row and col values:

    df.duplicated(['row', 'col']).any()
    
    True
    

    So when I pivot using

    df.pivot(index='row', columns='col', values='val0')
    

    I get the error mentioned above. In fact, I get the same error when I try to perform the same task with:

    df.set_index(['row', 'col'])['val0'].unstack()
    

    Examples

    What I'm going to do for each subsequent question is to answer it using pd.DataFrame.pivot_table. Then I'll provide alternatives to perform the same task.

    Questions 2 and 3

    How do I pivot df such that the col values are columns, row values are the index, and mean of val0 are the values?

    How do I make it so that missing values are 0?


    Question 4

    Can I get something other than mean, like maybe sum?


    Question 5

    Can I do more that one aggregation at a time?

    Notice that for pivot_table and crosstab I needed to pass list of callables. On the other hand, groupby.agg is able to take strings for a limited number of special functions. groupby.agg would also have taken the same callables we passed to the others, but it is often more efficient to leverage the string function names as there are efficiencies to be gained.


    Question 6

    Can I aggregate over multiple value columns?


    Question 7

    Can I subdivide by multiple columns?


    Question 8

    Can I subdivide by multiple columns?


    Question 9

    Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?


    Question 10

    How do I convert a DataFrame from long to wide by pivoting on ONLY two columns?


    Question 11

    How do I flatten the multiple index to single index after pivot

    If columns type object with string join

    df.columns = df.columns.map('|'.join)
    

    else format

    df.columns = df.columns.map('{0[0]}|{0[1]}'.format)