pythonpytorchsignal-processinglibrosatorchaudio

Slicing audio given video frames


I have audio from a video that I've loaded with PyTorch. Given a starting index and ending index corresponding to the video segment of interest, along with the video FPS and audio sampling rate, how would I go about extracting the slice of audio that matches the segment of interest of the video?

My intuition is to convert frames to time via:

start_time = frame_start / fps
end_time = frame_end / fps

the convert time to sample position with:

start_sample = int(math.floor(start_time * sr))
end_sample = int(math.floor(end_time * sr))

Is this correct? Or is there something I'm missing? I'm worried that there will be loss of information since I'm converting the samples into ints with floor.


Solution

  • Let's say you have

    fs = 44100                # audio sampling frequency
    vfr = 24                  # video frame rate
    frame_start  = 10         # index of first frame
    frame_end  = 10           # index of last frame
    audio = np.arange(44100)  # audio in form of ndarray
    

    you can calculate at which points in time you want to slice the audio

    time_start = frame_start / vfr
    time_end = frame_end / vfr         # or (frame_end + 1) / vfr for inclusive cut
    

    and then to which samples those points in time correspond:

    sample_start_idx = int(time_start * fs)
    sample_end_idx = int(time_end * fs)
    
    

    Its up to you if you want to be super-precise and take into account the fact that audio corresponding to a given frame should rather be starting half a frame before a frame and end half a frame after. In such a case use:

    time_start = np.clip((frame_start - 0.5) / vfr, 0, np.inf)
    time_end = (frame_end + 0.5) / vfr