pythonpandasdataframexls

Sort dataframe by dates


I am importing an excel file in which a column is a date in the format dd/mm/yyyy.

When I import it from the excel file, I think it is understood as a string. I need to sort the whole DataFrame by date, so I perform this code: import pandas as pd

import pandas as pd

dictionary = {
      "DATE" : ['12/02/2023', '02/01/2023', '02/01/2023', '10/02/2023'],    
      "CONCEPT" : ['Supermarket','Restaurant', 'Gas', 'Suscription'],
      "EUR" : [-150,-50,-45,-95]
          }
df = pd.DataFrame(dictionary)

df['DATE'] =  pd.to_datetime(df['DATE']).dt.strftime('%d/%m/%Y')

df = df.sort_values(by=['DATE'],axis=0, ascending=True)

If you perform this example, you will see it works perfectly fine, as the first-row date, 12/02/2023, is sorted in the last position. However, when I am using my real excel file, this date is interpreted as the 2nd of December 2023. Moreover, it sorts the date column as strings as not as dates; therefore, 31/01/2023 goes after 28/02/2023.

How could I solve this problem?


Solution

  • You need to use the dayfirst argument to to_datetime to make the conversion work as expected. Using your sample data:

    pd.to_datetime(df['DATE'])
    # 0   2023-12-02
    # 1   2023-02-01
    # 2   2023-02-01
    # 3   2023-10-02
    pd.to_datetime(df['DATE'], dayfirst=True)
    # 0   2023-02-12
    # 1   2023-01-02
    # 2   2023-01-02
    # 3   2023-02-10
    

    Your DATE column is then being sorted as strings because you have converted it to a string using strftime. You should sort first and then convert to a string just for display purposes.

    df['DATE'] =  pd.to_datetime(df['DATE'], dayfirst=True)
    df = df.sort_values(by=['DATE'],axis=0, ascending=True)
    df['DATE'] = df['DATE'].dt.strftime('%d/%m/%Y')
    

    Output

             DATE      CONCEPT  EUR
    1  02/01/2023   Restaurant  -50
    2  02/01/2023          Gas  -45
    3  10/02/2023  Suscription  -95
    0  12/02/2023  Supermarket -150
    

    If you need to reset the index, just use reset_index:

    df = df.reset_index(drop=True)
    

    Output:

             DATE      CONCEPT  EUR
    0  02/01/2023   Restaurant  -50
    1  02/01/2023          Gas  -45
    2  10/02/2023  Suscription  -95
    3  12/02/2023  Supermarket -150