Sorry for the noob question. I have a bunch of stocks in a sqlite3 database:
import pandas as pd
import sqlite3, config
connection = sqlite3.connect(config.db_file)
connection.row_factory = sqlite3.Row
df = pd.read_sql('SELECT * FROM stock_price', connection)
# sort the dataframe
df.sort_values(by='stock_id', inplace=True)
# # set the index to be this and don't drop
df.set_index(keys=['stock_id'], drop=False,inplace=True)
When I print the df, it gives me the following (where each stock_id refers to a unique stock, e.g APPL):
id stock_id date open high low close volume
stock_id
1 1 1 2022-08-02 9.83 9.845 9.83 9.830 584772
1 2 1 2022-08-03 9.84 9.860 9.84 9.820 7711
4 3 4 2022-08-03 10.38 10.380 10.38 10.380 199
5 46 5 2022-08-03 34.75 35.200 34.75 35.200 1007
5 45 5 2022-08-02 34.32 34.550 34.32 34.442 1252
... ... ... ... ... ... ... ... ...
98 8 98 2022-08-02 28.00 28.095 27.90 28.000 2417
99 71 99 2022-08-02 88.19 88.940 87.15 88.370 1045596
99 72 99 2022-08-03 88.34 88.550 87.65 88.410 982710
100 171 100 2022-08-02 117.58 120.010 117.08 119.270 67795
100 172 100 2022-08-03 119.80 121.940 120.60 121.440 4237
[178 rows x 8 columns]
I need to target each unique stock_id
individually, and get the prior close.
I know if each stock was in its own separate dataframe, I could do something like this:
final_df['previous close'] = final_df['c'].shift()
But when I've tried that, because everything in one dataframe, then you get one stock getting the previous close of an entirely different stock, which isn't what I want.
So my question:
What's the best to achieve splitting out all these different stocks from one single dataframe and being able to target them individually, and get the previous close price of each stock?
If I understand the question correctly, you just want the close of the previous row within each stock group. You can do that easily:
newdf = df.assign(prev_close=df.groupby(level=0)['close'].shift())
or, before setting the index to stock_id
:
newdf = df.assign(prev_close=df.groupby('stock_id')['close'].shift())
On your data sample:
>>> newdf
id stock_id date open high low close volume prev_close
stock_id
1 1 1 2022-08-02 9.83 9.845 9.83 9.830 584772 NaN
1 2 1 2022-08-03 9.84 9.860 9.84 9.820 7711 9.83
4 3 4 2022-08-03 10.38 10.380 10.38 10.380 199 NaN
5 46 5 2022-08-03 34.75 35.200 34.75 35.200 1007 NaN
5 45 5 2022-08-02 34.32 34.550 34.32 34.442 1252 35.20
98 8 98 2022-08-02 28.00 28.095 27.90 28.000 2417 NaN
99 71 99 2022-08-02 88.19 88.940 87.15 88.370 1045596 NaN
99 72 99 2022-08-03 88.34 88.550 87.65 88.410 982710 88.37
100 171 100 2022-08-02 117.58 120.010 117.08 119.270 67795 NaN
100 172 100 2022-08-03 119.80 121.940 120.60 121.440 4237 119.27
Side note: try to not use inplace=True
all over the place. It will cause you trouble in the long run (see e.g. here).
Personally, I would prefer this format:
cleandf = df.assign(
date=pd.to_datetime(df['date'])
).set_index(['stock_id', 'date']).sort_index()
newdf = cleandf.assign(prev_close=cleandf.groupby('stock_id')['close'].shift())
>>> newdf
id open high low close volume prev_close
stock_id date
1 2022-08-02 1 9.83 9.845 9.83 9.830 584772 NaN
2022-08-03 2 9.84 9.860 9.84 9.820 7711 9.830
4 2022-08-03 3 10.38 10.380 10.38 10.380 199 NaN
5 2022-08-02 45 34.32 34.550 34.32 34.442 1252 NaN
2022-08-03 46 34.75 35.200 34.75 35.200 1007 34.442
98 2022-08-02 8 28.00 28.095 27.90 28.000 2417 NaN
99 2022-08-02 71 88.19 88.940 87.15 88.370 1045596 NaN
2022-08-03 72 88.34 88.550 87.65 88.410 982710 88.370
100 2022-08-02 171 117.58 120.010 117.08 119.270 67795 NaN
2022-08-03 172 119.80 121.940 120.60 121.440 4237 119.270