pythonpython-3.xpandasdataframe

How to reduce the dimension of CSV file?


Suppose I have one CSV file with dimension m×n means m rows and n columns. I want to reduce its dimension by replacing average value of corresponding sub matrix.

Example 1:

Given we have 6×6 matrix (CSV file):

col1,col2,col3,col4,col5,col6
a1,b1,c1,d1,e1, f1
a2,b2,c2,d2,e2, f2
a3,b3,c3,d3,e3, f3
a4,b4,c4,d4,e4, f4
a5,b5,c5,d5,e5, f5
a6,b6,c6,d6,e6, f6

If we want 2×2 matrix, then resultant CSV file should be below:

col1, col2
a',  d'
a", d"

Where

a'=(a1+a2+a3+b1+b2+b3+c1+c2+c3)/9
a"=(a4+a5+a6+b4+b5+b6+c4+c5+c6)/9
d'=(d1+d2+d3+e1+e2+e3+f1+f2+f3)/9
d"=(d4+d5+d6+e4+e5+e6+f4+f5+f6)/9

Example:2

Given we have 5×6 matrix (CSV file):

col1,col2,col3,col4,col5,col6
a1,b1,c1,d1,e1, f1
a2,b2,c2,d2,e2, f2
a3,b3,c3,d3,e3, f3
a4,b4,c4,d4,e4, f4
a5,b5,c5,d5,e5, f5

If we want 2×2 matrix, then resultant CSV file should be below:

col1, col2
a',  d'
a", d"

Where

a'=(a1+a2+a3+b1+b2+b3+c1+c2+c3)/9
a"=(a4+a5+b4+b5+c4+c5)/6
d'=(d1+d2+d3+e1+e2+e3+f1+f2+f3)/9
d"=(d4+d5+e4+e5+f4+f5)/6

Example 3: Given we have 6×5 matrix (CSV file):

col1,col2,col3,col4,col5,col6
a1,b1,c1,d1,e1
a2,b2,c2,d2,e2
a3,b3,c3,d3,e3
a4,b4,c4,d4,e4
a5,b5,c5,d5,e5
a6,b6,c6,d6,e6

If we want 2×2 matrix, then resultant CSV file should be below:

col1, col2
a',  d'
a", d"

Where

a'=(a1+a2+a3+b1+b2+b3+c1+c2+c3)/9
a"=(a4+a5+a6+b4+b5+b6+c4+c5+c6)/9
d'=(d1+d2+d3+e1+e2+e3)/6
d"=(d4+d5+d6+e4+e5+e6)/6

I want the python code which can reduce the dimension by putting the average of the sum of all sub matrix. For in example1, we have given 6×6 matrix, we want 2×2 matrix, so we consider (6÷2) × (6÷2) = 3×3 sub matrix and calculate average of 9 elements of 3×3 matrix , which is the one element of resultant 2×2 matrix and so on.

And in example2, if given dimension isn't multiple of resultant dimension, we use ceiling function, we first start by consider ceiling(5÷2)×(6÷2) = 3×3 matrix, and at the end(corner) , we mayn't get 3×3 matrix, we just calculate average of remaining elements, as we see in example2, example3.


Solution

  • Assuming this example:

       col1  col2  col3  col4  col5  col6
    0     0     1     2     3     4     5
    1     6     7     8     9    10    11
    2    12    13    14    15    16    17
    3    18    19    20    21    22    23
    4    24    25    26    27    28    29
    5    30    31    32    33    34    35
    

    You could rename the indexes (with set_axis), stack, and groupby.mean:

    import math
    
    n, m = 2, 2 # desired shape
    
    out = (df
     .set_axis(np.arange(df.shape[0])//math.ceil(df.shape[0]/n), axis=0)
     .set_axis(np.arange(df.shape[1])//math.ceil(df.shape[1]/m), axis=1)
     .stack().groupby(level=[0, 1]).mean().unstack()
     .rename(columns=lambda x: f'col{x+1}') # optional
    )
    

    Alternative using and padding (numpy.pad) with NaNs before reshape and nanmean:

    import math
    
    n, m = 2, 2 # desired shape
    
    out = pd.DataFrame(np.nanmean(np.pad(df.astype(float), [(0, df.shape[0]%n), 
                                                            (0, df.shape[1]%m)],
                                         constant_values=np.nan)
                                    .reshape(n, math.ceil(df.shape[0]/n), m, -1),
                                  axis=(1, 3)
                                 )
                      ).rename(columns=lambda x: f'col{x+1}')
    

    Output:

       col1  col2
    0   7.0  10.0
    1  25.0  28.0
    

    Output with a 5x6 input (last row missing):

       col1  col2
    0   7.0  10.0
    1  22.0  25.0