pandasdataframe

ValueError: NaTType does not support strftime


I tried using the dropna method to remove rows with missing "Date" values before resetting the index, then I get a KeyError:

df.set_index('Date', inplace=True)
df = df.between_time(TIME_BINANCE_OPEN, TIME_BINANCE_CLOSE)
df = df.dropna(subset=['Date'])
df.reset_index(inplace=True)
Traceback (most recent call last):
  File "/Users/anon/stocks-prediction-Machine-learning-RealTime-TensorFlow/0_API_alpaca_historical.py", line 53, in <module>
    df.dropna(subset=['Date'], inplace=True)
  File "/Users/anon/stocks-prediction-Machine-learning-RealTime-TensorFlow/MLT/lib/python3.11/site-packages/pandas/core/frame.py", line 6421, in dropna
    raise KeyError(np.array(subset)[check].tolist())
KeyError: ['Date']

The current code:

for symbol in stocks_list:
    print("Starting data fetching process Stock: ", symbol)
    df = get_binance_bars(symbol, interval, START_DATE, END_DATE)
    print("Data fetching process completed df.shape: ", df.shape)

    print(df)
    print(df.columns)

    if df is not None:
        df['Date'] = pd.to_datetime(df['Date'])
        TIME_BINANCE_OPEN = "00:01:00"
        TIME_BINANCE_CLOSE = "08:19:00"
        
        # Perform time-based filtering using the "Date" column
        df = df[(df['Date'].dt.time >= pd.to_datetime(TIME_BINANCE_OPEN).time()) &
                (df['Date'].dt.time <= pd.to_datetime(TIME_BINANCE_CLOSE).time())]
        df = df.dropna(subset=['Date'])
        if not df.empty:
            max_recent_date = df['Date'].max().strftime("%Y-%m-%d %H:%M:%S")
            min_recent_date = df['Date'].min().strftime("%Y-%m-%d %H:%M:%S")
        else:
            min_recent_date = max_recent_date = None
        directory = "d_price/RAW_binance/"
        if not os.path.exists(directory):
            os.makedirs(directory)

        file_path = directory + "binance_" + symbol + '_' + interval + ".csv"
        df.to_csv(file_path, sep="\t", index=None)

        print("\tSTART: ", str(df['Date'].min()),  "  END: ", str(df['Date'].max()), " shape: ", df.shape, "\n")

Traceback:

Starting data fetching process Stock:  DASHUSDT
Data fetching process completed df.shape:  (500, 6)
                 Date    Open    High     Low   Close   Volume
0 2022-01-01 00:00:00  133.62  133.98  133.59  133.93  743.246
1 2022-01-01 00:01:00  133.93  134.79  133.85  134.65  835.018
2 2022-01-01 00:02:00  134.61  134.75  134.42  134.51  421.264
3 2022-01-01 00:03:00  134.56  134.63  134.37  134.49  209.346
4 2022-01-01 00:04:00  134.52  134.68  134.42  134.48  204.820
Index(['Date', 'Open', 'High', 'Low', 'Close', 'Volume'], dtype='object')
Traceback (most recent call last):
  File "/Users/anon/stocks-prediction-Machine-learning-RealTime-TensorFlow/0_API_alpaca_historical.py", line 52, in <module>
    max_recent_date = df['Date'].max().strftime("%Y-%m-%d %H:%M:%S")
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "nattype.pyx", line 58, in pandas._libs.tslibs.nattype._make_error_func.f
ValueError: NaTType does not support strftime

I've found out that when you reset the index using df.reset_index(inplace=True), it can result in an empty DataFrame if there are any missing or NaN values in the "Date" column. I made sure there are no missing values in the "Date" column before resetting the index by using "dropna" to remove NaN values before reset_index. However my workaround didn't work.


Solution

  • This issue is that you dropped all rows in df = df.dropna(subset=['Date']). Thus there is no min/max date left to use.

    You probably want to use a conditional here:

    # ...
            df = df.dropna(subset=['Date'])
            if not df.empty:
                max_recent_date = df['Date'].max().strftime("%Y-%m-%d %H:%M:%S")
                min_recent_date = df['Date'].min().strftime("%Y-%m-%d %H:%M:%S")
            else:
                min_recent_date = max_recent_date = None
    

    Note that using a valid (non-empty) slice works as expected:

    if df is not None:
            df['Date'] = pd.to_datetime(df['Date'])
            TIME_BINANCE_OPEN = "00:01:00"    # this will keep 2 rows
            TIME_BINANCE_CLOSE = "00:02:00"   #
            
            # Perform time-based filtering using the "Date" column
            df = df[(df['Date'].dt.time >= pd.to_datetime(TIME_BINANCE_OPEN).time()) &
                    (df['Date'].dt.time <= pd.to_datetime(TIME_BINANCE_CLOSE).time())]
            df = df.dropna(subset=['Date'])
            if not df.empty:
                max_recent_date = df['Date'].max().strftime("%Y-%m-%d %H:%M:%S")
                min_recent_date = df['Date'].min().strftime("%Y-%m-%d %H:%M:%S")
            else:
                min_recent_date = max_recent_date = None
                
    print(min_recent_date)
    print(max_recent_date)
    

    Output:

    2022-01-01 00:01:00
    2022-01-01 00:02:00
    

    Alternatively, use a Series operation:

    # ...
            df = df.dropna(subset=['Date'])
            min_recent_date, max_recent_date = df['Date'].agg(['min', 'max']).dt.strftime("%Y-%m-%d %H:%M:%S")