pandasdataframestring-concatenationdatetime-conversionread-csv

FutureWarning: Support for nested sequences for 'parse_dates' in pd.read_csv is deprecated. How to combine date and time columns with pd.to_datetime?


Here is an example of my .csv file:

date, time, value
20240112,085917,11
20240112,085917,22

I used to import it to DataFrame with the following way:

df = pd.read_csv(csv_file, parse_dates=[['date', 'time']]).set_index('date_time')

And I was getting the following structure:

date_time             value
2023-10-02 10:00:00   11
2023-10-02 10:01:00   22

Now after updating to Pandas 2.2.0 I started to get this error:

FutureWarning: Support for nested sequences for 'parse_dates' in pd.read_csv is deprecated. Combine the desired columns with pd.to_datetime after parsing instead.

So in order to achieve the same result now I have to do:

df['datetime'] = df.date.astype(str) + ' ' + df.time.astype(str)
df['datetime'] = pd.to_datetime(df.datetime, format="%Y%m%d %H%M%S")
df = df.drop(['date', 'time'], axis=1).set_index('datetime')

Is there any way to do it in the new versions of Pandas without strings concatenations which are very slow usually?


Solution

  • Since parsing the date will involve strings anyway and given your time format without separator, this seems like the most reasonable option.

    You could simplify your code to read the columns as string directly and to pop the columns:

    df = pd.read_csv(csv_file, sep=', *', engine='python',
                     dtype={'date': str, 'time': str})
    
    df['datetime'] = pd.to_datetime(df.pop('date')+' '+df.pop('time'),
                                    format="%Y%m%d %H%M%S")
    df = df.set_index('datetime')
    

    NB. if your days and hours/minutes/seconds are reliably padded with zeros, you can use df.pop('date')+df.pop('time') and format="%Y%m%d%H%M%S".

    Output:

                         value
    datetime                  
    2024-01-12 08:59:17     11
    2024-01-12 08:59:17     22
    

    A variant with numeric operations and a timedelta:

    df = pd.read_csv(csv_file, sep=', *', engine='python',
                     dtype={'date': str})
    
    a = df.pop('time').to_numpy()
    a, s = np.divmod(a, 100)
    h, m = np.divmod(a, 100)
    
    df['datetime'] = (pd.to_datetime(df.pop('date'))
                     +pd.to_timedelta(h*3600+m*60+s, unit='s')
                     )
    

    which is actually much slower (27.7 ms ± 4.11 ms per loop vs 350 µs ± 44.5 µs per loop for the string approach)