pythonpandaspivot

Finding offending entries when pivot fails with a ValueError: Index contains duplicate entries, cannot reshape


I have a dataframe with 3 columns and datatypes: Datetime (Datetime dtype), Area (string), Value (float).

I want to pivot it so I get separate columns for each unique entry in Area.

df_pivot = pd.pivot(df,index='Datetime',columns='Area','values='Value')

Now I get the following error

ValueError: Index contains duplicate entries, cannot reshape.

I drop duplicates without subsetting and try again but get the same result.

I try to find the offending entries with no success:

duplicates = df[df.duplicated(subset=['Datetime','Area','Value'], keep=False)]

This returns an empty dataframe

So I try resetting the index and repeating subsetting with the nex index:

duplicates = df[df.duplicated(subset=['Datetime','Area','Value','index'], keep=False)]

I get the same result, an empty dataframe and an unpivottable df.

Is there some other way to find the offending entries?


Solution

  • You need find duplicates per Datetime and Area, not for all 3 columns, because new index and new column is created by Datetime and Area columns:

    df = pd.DataFrame(
        {
        "Datetime" : [1,1,2,3,3],
        "Value" : [1,2,3,4,5],
        "Area" : ["D", "D", "D", "E", "E"]
        }
     )
    

    Get duplicates per both columns:

    duplicates = df[df.duplicated(subset=['Datetime','Area'], keep=False)]
    print (duplicates)
       Datetime  Value Area
    0         1      1    D
    1         1      2    D
    3         3      4    E
    4         3      5    E
    

    Rmeove duplicates per both columns, keep first values is possible by DataFrame.drop_duplicates:

    print (df.drop_duplicates(subset=['Datetime','Area']))
       Datetime  Value Area
    0         1      1    D
    2         2      3    D
    3         3      4    E
    

    All toghether:

    df_pivot = (df.drop_duplicates(subset=['Datetime','Area'])
                  .pivot(index='Datetime',columns='Area',values='Value'))
    print (df_pivot)
    Area        D    E
    Datetime          
    1         1.0  NaN
    2         3.0  NaN
    3         NaN  4.0
    

    Another idea is aggregate duplicated values, e.g. by sum:

    df_pivot = df.pivot_table(index='Datetime',columns='Area',values='Value', aggfunc='sum')
    print (df_pivot)
    Area        D    E
    Datetime          
    1         3.0  NaN
    2         3.0  NaN
    3         NaN  9.0