I'm reading in quite a lot of data from some text files and save it to dataframes to easily be able to access the columns later when processing the data in a later stage. I feel the way I'm currently doing it is very inefficient and would like help with a better way of doing it. The text files are values for different variables at every time step of a simulation. The simulation are grouped by into categories and I want the average in every category. I don't want to take the average of every variable that is outputted but only the ones that I choose.
Some basics of the data. The number of files is known. The number of dataframes I want to average over is known (all files from a category). The columns are always the same in all files and thus the dataframes. The number of and name of categories are known.
Below is some simplified semi-pseudo code of what I'm currently doing. Note that it's more files and columns in the raw data.
sim_nr = np.arange(1,13)
categories = ['a','b','c']
variables = ['S', 'P', 'L']
multi_column = pd.MultiIndex.from_tuples(list(zip(categories, sim_nr)))
all_data = {}
for i in number_of_files:
data = pd.read_csv('data_'+str(i)+'.txt')
all_data[mutlicolumn[i]] = data
#for working example use this dictionary,
all_data = {('a',1): pd.DataFrame({'S':[1,2,3,4], 'Q':[1,2,3,4], 'P':[78,5,3,6], 'M':[1,2,3,4], 'L':[23,5,61,85]}),
('a',2): pd.DataFrame({'S':[1,2,3,4], 'Q':[1,2,3,4], 'P':[78,5,3,6], 'M':[1,2,3,4], 'L':[32,73,30,41]}),
('b',3):pd.DataFrame({'S':[1,2,3,4], 'Q':[1,2,3,4], 'P':[2,3,4,5], 'M':[1,2,3,4], 'L':[32,73,30,41]}),
('b',4):pd.DataFrame({'S':[1,2,3,4], 'Q':[1,2,3,4], 'P':[78,5,3,6], 'M':[1,2,3,4], 'L':[32,73,30,41]}),
('c',5):pd.DataFrame({'S':[1,2,3,4], 'Q':[1,2,3,4], 'P':[11,20,55,8], 'M':[1,2,3,4], 'L':[23,5,61,85]}),
('c',6):pd.DataFrame({'S':[1,2,3,4], 'Q':[1,2,3,4], 'P':[78,5,3,6], 'M':[1,2,3,4], 'L':[32,73,30,41]})}
variableS = pd.DataFrame()
variableP = pd.DataFrame()
variableL = pd.DataFrame()
for cat in categories:
test = pd.DataFrame()
for key, data in all_data.items():
if key[0] == cat:
# print(key[0])
test= pd.concat([test, data[variables]], axis= 1)
variableS[cat] = test['S'].mean(axis=1)
variableP[cat] = test['P'].mean(axis=1)
variableL[cat] = test['L'].mean(axis=1)
I would like help with how I can improve on this. Especially the last loop with how to save the averaged values of the different variables.
You could first concat
your datasets in a single DataFrame, then use groupby.mean
to aggregate the data per index+second level in the tuple, then swap the first index level and columns with stack
/unstack
, finally split the groups with groupby
and a dictionary comprehension:
categories = ['a', 'b', 'c']
variables = ['S', 'P', 'L']
# reshape and aggregate as mean
tmp = (pd.concat(all_data)
.loc[categories, variables]
.groupby(level=[0, 2]).mean()
.stack().unstack(0)
)
# split the variables
out = {k: v.droplevel(1) for k,v in tmp.groupby(level=1, sort=False)}
# or
# out = dict(list(tmp.droplevel(1)
# .groupby(tmp.index.get_level_values(1),
# sort=False)))
Output:
{'L': a b c
0 27.5 32.0 27.5
1 39.0 73.0 39.0
2 45.5 30.0 45.5
3 63.0 41.0 63.0,
'P': a b c
0 78.0 40.0 44.5
1 5.0 4.0 12.5
2 3.0 3.5 29.0
3 6.0 5.5 7.0,
'S': a b c
0 1.0 1.0 1.0
1 2.0 2.0 2.0
2 3.0 3.0 3.0
3 4.0 4.0 4.0,
}
Intermediates:
# pd.concat(all_data).loc[categories, variables].groupby(level=[0, 2]).mean()
S P L
a 0 1.0 78.0 27.5
1 2.0 5.0 39.0
2 3.0 3.0 45.5
3 4.0 6.0 63.0
b 0 1.0 40.0 32.0
1 2.0 4.0 73.0
2 3.0 3.5 30.0
3 4.0 5.5 41.0
c 0 1.0 44.5 27.5
1 2.0 12.5 39.0
2 3.0 29.0 45.5
3 4.0 7.0 63.0
# .stack().unstack(0)
a b c
0 S 1.0 1.0 1.0
P 78.0 40.0 44.5
L 27.5 32.0 27.5
1 S 2.0 2.0 2.0
P 5.0 4.0 12.5
L 39.0 73.0 39.0
2 S 3.0 3.0 3.0
P 3.0 3.5 29.0
L 45.5 30.0 45.5
3 S 4.0 4.0 4.0
P 6.0 5.5 7.0
L 63.0 41.0 63.0