I tried using the dropna
method to remove rows with missing "Date" values before resetting the index, then I get a KeyError:
df.set_index('Date', inplace=True)
df = df.between_time(TIME_BINANCE_OPEN, TIME_BINANCE_CLOSE)
df = df.dropna(subset=['Date'])
df.reset_index(inplace=True)
Traceback (most recent call last):
File "/Users/anon/stocks-prediction-Machine-learning-RealTime-TensorFlow/0_API_alpaca_historical.py", line 53, in <module>
df.dropna(subset=['Date'], inplace=True)
File "/Users/anon/stocks-prediction-Machine-learning-RealTime-TensorFlow/MLT/lib/python3.11/site-packages/pandas/core/frame.py", line 6421, in dropna
raise KeyError(np.array(subset)[check].tolist())
KeyError: ['Date']
The current code:
for symbol in stocks_list:
print("Starting data fetching process Stock: ", symbol)
df = get_binance_bars(symbol, interval, START_DATE, END_DATE)
print("Data fetching process completed df.shape: ", df.shape)
print(df)
print(df.columns)
if df is not None:
df['Date'] = pd.to_datetime(df['Date'])
TIME_BINANCE_OPEN = "00:01:00"
TIME_BINANCE_CLOSE = "08:19:00"
# Perform time-based filtering using the "Date" column
df = df[(df['Date'].dt.time >= pd.to_datetime(TIME_BINANCE_OPEN).time()) &
(df['Date'].dt.time <= pd.to_datetime(TIME_BINANCE_CLOSE).time())]
df = df.dropna(subset=['Date'])
if not df.empty:
max_recent_date = df['Date'].max().strftime("%Y-%m-%d %H:%M:%S")
min_recent_date = df['Date'].min().strftime("%Y-%m-%d %H:%M:%S")
else:
min_recent_date = max_recent_date = None
directory = "d_price/RAW_binance/"
if not os.path.exists(directory):
os.makedirs(directory)
file_path = directory + "binance_" + symbol + '_' + interval + ".csv"
df.to_csv(file_path, sep="\t", index=None)
print("\tSTART: ", str(df['Date'].min()), " END: ", str(df['Date'].max()), " shape: ", df.shape, "\n")
Traceback:
Starting data fetching process Stock: DASHUSDT
Data fetching process completed df.shape: (500, 6)
Date Open High Low Close Volume
0 2022-01-01 00:00:00 133.62 133.98 133.59 133.93 743.246
1 2022-01-01 00:01:00 133.93 134.79 133.85 134.65 835.018
2 2022-01-01 00:02:00 134.61 134.75 134.42 134.51 421.264
3 2022-01-01 00:03:00 134.56 134.63 134.37 134.49 209.346
4 2022-01-01 00:04:00 134.52 134.68 134.42 134.48 204.820
Index(['Date', 'Open', 'High', 'Low', 'Close', 'Volume'], dtype='object')
Traceback (most recent call last):
File "/Users/anon/stocks-prediction-Machine-learning-RealTime-TensorFlow/0_API_alpaca_historical.py", line 52, in <module>
max_recent_date = df['Date'].max().strftime("%Y-%m-%d %H:%M:%S")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "nattype.pyx", line 58, in pandas._libs.tslibs.nattype._make_error_func.f
ValueError: NaTType does not support strftime
I've found out that when you reset the index using df.reset_index(inplace=True)
, it can result in an empty DataFrame if there are any missing or NaN values in the "Date" column. I made sure there are no missing values in the "Date" column before resetting the index by using "dropna" to remove NaN values before reset_index. However my workaround didn't work.
This issue is that you dropped all rows in df = df.dropna(subset=['Date'])
. Thus there is no min/max date left to use.
You probably want to use a conditional here:
# ...
df = df.dropna(subset=['Date'])
if not df.empty:
max_recent_date = df['Date'].max().strftime("%Y-%m-%d %H:%M:%S")
min_recent_date = df['Date'].min().strftime("%Y-%m-%d %H:%M:%S")
else:
min_recent_date = max_recent_date = None
Note that using a valid (non-empty) slice works as expected:
if df is not None:
df['Date'] = pd.to_datetime(df['Date'])
TIME_BINANCE_OPEN = "00:01:00" # this will keep 2 rows
TIME_BINANCE_CLOSE = "00:02:00" #
# Perform time-based filtering using the "Date" column
df = df[(df['Date'].dt.time >= pd.to_datetime(TIME_BINANCE_OPEN).time()) &
(df['Date'].dt.time <= pd.to_datetime(TIME_BINANCE_CLOSE).time())]
df = df.dropna(subset=['Date'])
if not df.empty:
max_recent_date = df['Date'].max().strftime("%Y-%m-%d %H:%M:%S")
min_recent_date = df['Date'].min().strftime("%Y-%m-%d %H:%M:%S")
else:
min_recent_date = max_recent_date = None
print(min_recent_date)
print(max_recent_date)
Output:
2022-01-01 00:01:00
2022-01-01 00:02:00
Alternatively, use a Series operation:
# ...
df = df.dropna(subset=['Date'])
min_recent_date, max_recent_date = df['Date'].agg(['min', 'max']).dt.strftime("%Y-%m-%d %H:%M:%S")