I have a long video that I wish to isolate only a part of, and extract the log-spectogram for the corresponding isolated part. I am using moviepy
to load the original file, which is in mp4
format. Then I use subclip
to extract the relevant part, and .audio
to refer only to the audio of the file.
I There is an option to extract the audio data and the sampling rate, just like when loading using librosa
. The full code of the audio data and sampling rate extraction is the following:
with VideoFileClip(input_file) as video:
# Use the first three seconds of the video
clip = video.subclip(0, 3)
# Get the audio data and sample rate
y = clip.audio.to_soundarray()
sr = clip.audio.fps
l = clip.audio.duration
print(f'y:{y.shape}, sr:{sr}, length:{l}')
And it is resulted in:
>>> y:(132300, 2), sr:44100, length:3
Next, I wish to convert the above data into spectrogram. When I try the following, my machine crushes, or I get an error.
with VideoFileClip(input_file) as video:
# Trim video
clip = video.subclip(start_time_sec, end_time_sec)
# Get length of the trimed video
length = end_time_sec-start_time_sec
# Get the audio data and sample rate
y = clip.audio.to_soundarray()
sr = clip.audio.fps
l = clip.audio.duration
# Do something with the audio data
spectrogram = librosa.feature.melspectrogram(y=y, n_fft=2048, hop_length=512)
librosa.display.specshow(spectrogram, sr=sr)
------->
Output exceeds the size limit. Open the full output data in a text editor---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[33], line 16
14 # Do something with the audio data
15 spectrogram = librosa.feature.melspectrogram(y=y, n_fft=2048, hop_length=512)
---> 16 librosa.display.specshow(spectrogram, sr=sr)
File /Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/librosa/display.py:1215, in specshow(data, x_coords, y_coords, x_axis, y_axis, sr, hop_length, n_fft, win_length, fmin, fmax, tuning, bins_per_octave, key, Sa, mela, thaat, auto_aspect, htk, unicode, intervals, unison, ax, **kwargs)
1211 x_coords = __mesh_coords(x_axis, x_coords, data.shape[1], **all_params)
1213 axes = __check_axes(ax)
-> 1215 out = axes.pcolormesh(x_coords, y_coords, data, **kwargs)
1217 __set_current_image(ax, out)
1219 # Set up axis scaling
File /Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/matplotlib/__init__.py:1442, in _preprocess_data..inner(ax, data, *args, **kwargs)
1439 @functools.wraps(func)
1440 def inner(ax, *args, data=None, **kwargs):
1441 if data is None:
-> 1442 return func(ax, *map(sanitize_sequence, args), **kwargs)
1444 bound = new_sig.bind(ax, *args, **kwargs)
1445 auto_label = (bound.arguments.get(label_namer)
1446 or bound.kwargs.get(label_namer))
File /Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/matplotlib/axes/_axes.py:6229, in Axes.pcolormesh(self, alpha, norm, cmap, vmin, vmax, shading, antialiased, *args, **kwargs)
6225 C = C.ravel()
...
1984 f"shading, A should have shape "
1985 f"{' or '.join(map(str, ok_shapes))}, not {A.shape}")
1986 return super().set_array(A)
ValueError: For X (129) and Y (132301) with flat shading, A should have shape (132300, 128, 3) or (132300, 128, 4) or (132300, 128) or (16934400,), not (132300, 128, 1)
Finally, when I use power_to_db
and plt.imshow
as in the following code
ps = librosa.feature.melspectrogram(y=y, sr=sr)
ps_db= librosa.power_to_db(ps)
# librosa.display.specshow(ps_db, x_axis='s', y_axis='log')
plt.imshow(ps_db, origin="lower", cmap=plt.get_cmap("magma"))
I get the following undesired result:
Is it something with the overlap size or something?
The librosa multi-channel format is channels-first, where as your audio seems to be channels-last. Try y = clip.audio.to_soundarray().T
, to convert it.
Also it there might be problems with passing a stereo mel-spectrogram, to librosa.display.specshow
.
If it is acceptable to work in mono, then convert the audio using y = librosa.to_mono(y)
before doing the processing.