I have audio from a video that I've loaded with PyTorch. Given a starting index and ending index corresponding to the video segment of interest, along with the video FPS and audio sampling rate, how would I go about extracting the slice of audio that matches the segment of interest of the video?
My intuition is to convert frames to time via:
start_time = frame_start / fps
end_time = frame_end / fps
the convert time to sample position with:
start_sample = int(math.floor(start_time * sr))
end_sample = int(math.floor(end_time * sr))
Is this correct? Or is there something I'm missing? I'm worried that there will be loss of information since I'm converting the samples into ints with floor.
Let's say you have
fs = 44100 # audio sampling frequency
vfr = 24 # video frame rate
frame_start = 10 # index of first frame
frame_end = 10 # index of last frame
audio = np.arange(44100) # audio in form of ndarray
you can calculate at which points in time you want to slice the audio
time_start = frame_start / vfr
time_end = frame_end / vfr # or (frame_end + 1) / vfr for inclusive cut
and then to which samples those points in time correspond:
sample_start_idx = int(time_start * fs)
sample_end_idx = int(time_end * fs)
Its up to you if you want to be super-precise and take into account the fact that audio corresponding to a given frame should rather be starting half a frame before a frame and end half a frame after. In such a case use:
time_start = np.clip((frame_start - 0.5) / vfr, 0, np.inf)
time_end = (frame_end + 0.5) / vfr