Suppose I have one CSV file with dimension m×n means m rows and n columns. I want to reduce its dimension by replacing average value of corresponding sub matrix.
Example 1:
Given we have 6×6 matrix (CSV file):
col1,col2,col3,col4,col5,col6
a1,b1,c1,d1,e1, f1
a2,b2,c2,d2,e2, f2
a3,b3,c3,d3,e3, f3
a4,b4,c4,d4,e4, f4
a5,b5,c5,d5,e5, f5
a6,b6,c6,d6,e6, f6
If we want 2×2 matrix, then resultant CSV file should be below:
col1, col2
a', d'
a", d"
Where
a'=(a1+a2+a3+b1+b2+b3+c1+c2+c3)/9
a"=(a4+a5+a6+b4+b5+b6+c4+c5+c6)/9
d'=(d1+d2+d3+e1+e2+e3+f1+f2+f3)/9
d"=(d4+d5+d6+e4+e5+e6+f4+f5+f6)/9
Example:2
Given we have 5×6 matrix (CSV file):
col1,col2,col3,col4,col5,col6
a1,b1,c1,d1,e1, f1
a2,b2,c2,d2,e2, f2
a3,b3,c3,d3,e3, f3
a4,b4,c4,d4,e4, f4
a5,b5,c5,d5,e5, f5
If we want 2×2 matrix, then resultant CSV file should be below:
col1, col2
a', d'
a", d"
Where
a'=(a1+a2+a3+b1+b2+b3+c1+c2+c3)/9
a"=(a4+a5+b4+b5+c4+c5)/6
d'=(d1+d2+d3+e1+e2+e3+f1+f2+f3)/9
d"=(d4+d5+e4+e5+f4+f5)/6
Example 3: Given we have 6×5 matrix (CSV file):
col1,col2,col3,col4,col5,col6
a1,b1,c1,d1,e1
a2,b2,c2,d2,e2
a3,b3,c3,d3,e3
a4,b4,c4,d4,e4
a5,b5,c5,d5,e5
a6,b6,c6,d6,e6
If we want 2×2 matrix, then resultant CSV file should be below:
col1, col2
a', d'
a", d"
Where
a'=(a1+a2+a3+b1+b2+b3+c1+c2+c3)/9
a"=(a4+a5+a6+b4+b5+b6+c4+c5+c6)/9
d'=(d1+d2+d3+e1+e2+e3)/6
d"=(d4+d5+d6+e4+e5+e6)/6
I want the python code which can reduce the dimension by putting the average of the sum of all sub matrix. For in example1, we have given 6×6 matrix, we want 2×2 matrix, so we consider (6÷2) × (6÷2) = 3×3 sub matrix and calculate average of 9 elements of 3×3 matrix , which is the one element of resultant 2×2 matrix and so on.
And in example2, if given dimension isn't multiple of resultant dimension, we use ceiling function, we first start by consider ceiling(5÷2)×(6÷2) = 3×3 matrix, and at the end(corner) , we mayn't get 3×3 matrix, we just calculate average of remaining elements, as we see in example2, example3.
Assuming this example:
col1 col2 col3 col4 col5 col6
0 0 1 2 3 4 5
1 6 7 8 9 10 11
2 12 13 14 15 16 17
3 18 19 20 21 22 23
4 24 25 26 27 28 29
5 30 31 32 33 34 35
You could rename the indexes (with set_axis
), stack
, and groupby.mean
:
import math
n, m = 2, 2 # desired shape
out = (df
.set_axis(np.arange(df.shape[0])//math.ceil(df.shape[0]/n), axis=0)
.set_axis(np.arange(df.shape[1])//math.ceil(df.shape[1]/m), axis=1)
.stack().groupby(level=[0, 1]).mean().unstack()
.rename(columns=lambda x: f'col{x+1}') # optional
)
Alternative using numpy and padding (numpy.pad
) with NaNs before reshape
and nanmean
:
import math
n, m = 2, 2 # desired shape
out = pd.DataFrame(np.nanmean(np.pad(df.astype(float), [(0, df.shape[0]%n),
(0, df.shape[1]%m)],
constant_values=np.nan)
.reshape(n, math.ceil(df.shape[0]/n), m, -1),
axis=(1, 3)
)
).rename(columns=lambda x: f'col{x+1}')
Output:
col1 col2
0 7.0 10.0
1 25.0 28.0
Output with a 5x6 input (last row missing):
col1 col2
0 7.0 10.0
1 22.0 25.0