I extracted video frames and mfcc from a video. I got (524, 64, 64) video frames and a shape of (80, 525) mfcc. The number of frames the data match but the dimensions are inversed. How can I make align the mfcc to be in the size (525, 80).
And by permuting the dimensions, will it distort the audio information?
Swapping the dimensions of a multidimensional array does not alter the values at all, only their locations.
To swap such that the time-axis is the first in your MFCC, use the .T (for transpose) numpy attribute.
mfcc_timefirst = mfcc.T