pythonnumpyaudiosignal-processingsoundfile

How can I extract the duration and offset from a numpy array representing audio?


I'm currently running a script where I take an entire audio file and save it using the audiofile library (which, in-turn, uses the soundfile library) in Python.

I'm trying to mimic the behavior of audiofile.read() where I give it an offset and duration (in seconds) and only return the respective numpy array of that particular sound interval. The only difference here is that instead of taking in a .wav file like the library requires, I'll already have the entire audio file as a numpy array and need to extract the correct start and end intervals from it.

I've tried copying the logic of calculating the start and end and just slicing the numpy array from sound_file[start:end] but that doesn't seem to work. I'm not too familiar with how signal processing works with audio files so I'm at a little bit of a loss here and any help would be appreciated!

Here's my code:

I expect it to take in a numpy array, and return the same numpy array sliced to include only the start + the duration specified. All the files I've loaded were originally 96KHz that were resampled to 16KHz and saved as numpy arrays.


from audiofile.core.utils import duration_in_seconds
import audmath

def read_from_np(
    file,
    duration,
    offset,
    sampling_rate = 16000
):

    if duration is not None:
        duration = duration_in_seconds(duration, sampling_rate)
        if np.isnan(duration):
            duration = None
    if offset is not None and offset != 0:
        offset = duration_in_seconds(offset, sampling_rate)
        if np.isnan(offset):
            offset = None

    # Support for negative offset/duration values
    # by counting them from end of signal
    #
    if offset is not None and offset < 0 or duration is not None and duration < 0:
        # Import duration here to avoid circular imports
        from audiofile.core.info import duration as get_duration

        signal_duration = get_duration(file)
    # offset | duration
    # None   | < 0
    if offset is None and duration is not None and duration < 0:
        offset = max([0, signal_duration + duration])
        duration = None
    # None   | >= 0
    if offset is None and duration is not None and duration >= 0:
        if np.isinf(duration):
            duration = None
    # >= 0   | < 0
    elif offset is not None and offset >= 0 and duration is not None and duration < 0:
        if np.isinf(offset) and np.isinf(duration):
            offset = 0
            duration = None
        elif np.isinf(offset):
            duration = 0
        else:
            if np.isinf(duration):
                offset = min([offset, signal_duration])
                duration = np.sign(duration) * signal_duration
            orig_offset = offset
            offset = max([0, offset + duration])
            duration = min([-duration, orig_offset])
    # >= 0   | >= 0
    elif offset is not None and offset >= 0 and duration is not None and duration >= 0:
        if np.isinf(offset):
            duration = 0
        elif np.isinf(duration):
            duration = None
    # < 0    | None
    elif offset is not None and offset < 0 and duration is None:
        offset = max([0, signal_duration + offset])
    # >= 0    | None
    elif offset is not None and offset >= 0 and duration is None:
        if np.isinf(offset):
            duration = 0
    # < 0    | > 0
    elif offset is not None and offset < 0 and duration is not None and duration > 0:
        if np.isinf(offset) and np.isinf(duration):
            offset = 0
            duration = None
        elif np.isinf(offset):
            duration = 0
        elif np.isinf(duration):
            duration = None
        else:
            offset = signal_duration + offset
            if offset < 0:
                duration = max([0, duration + offset])
            else:
                duration = min([duration, signal_duration - offset])
            offset = max([0, offset])
    # < 0    | < 0
    elif offset is not None and offset < 0 and duration is not None and duration < 0:
        if np.isinf(offset):
            duration = 0
        elif np.isinf(duration):
            duration = -signal_duration
        else:
            orig_offset = offset
            offset = max([0, signal_duration + offset + duration])
            duration = min([-duration, signal_duration + orig_offset])
            duration = max([0, duration])

    # Convert to samples
    #
    # Handle duration first
    # and returned immediately
    # if duration == 0
    if duration is not None and duration != 0:
        duration = audmath.samples(duration, sampling_rate)
    if duration == 0:
        from audiofile.core.info import channels as get_channels

        channels = get_channels(file)
        if channels > 1 or always_2d:
            signal = np.zeros((channels, 0))
        else:
            signal = np.zeros((0,))
        return signal, sampling_rate
    if offset is not None and offset != 0:
        offset = audmath.samples(offset, sampling_rate)
    else:
        offset = 0


    start = offset
    # duration == 0 is handled further above with immediate return
    if duration is not None:
        stop = duration + start

    return np.expand_dims(file[0, start:stop], 0)


Solution

  • Your code boils down to

        return np.expand_dims(file[0, start:stop], 0)
    

    which is correct.

    So if you're unhappy with the result, it is due to computing the wrong (start, stop) pair, that is, the wrong (offset, duration) pair.

    The sample rate is apparently fixed at exactly 16_000 samples per second. The number of channels can be 1 or 2, which seems worrisome.

    There's a crazy amount of optional behavior associated with the offset and duration parameters. Get rid of it. Focus on writing a simple helper which accepts an offset that is always a non-negative integer, and a duration that is always a positive finite integer. No NaNs. Use assert or raise so that None or negative will blow up with fatal error.

    Next, focus on audio segments that always have the same number of channels.

    At that point, it won't be hard to get it right.