python numpy audio audio-processing pydub

Passing a numpy audio array around different audio libraries

I'm working on a project which involved numerous audio-processing tasks with text-to-speech, but I've hit a small snag. I'm going to be processing possibly hundreds of TTS audio segments, so I want to mitigate file IO as much as possible. I need to synthesize speech with Coqui TTS, time-stretch it with AudioTSM, and then perform additoinal processing and splicing with PyDub.

I'm using Coqui TTS to generate text like this:

from TTS.api import TTS
tts = TTS()
audio = tts.tts("Hello StackOverflow! Please help me!")

This returns a List of Float32 values which needs to be converted to be used with AudioTSM's ArrayReader

Like this:

audio_array = np.array(tts_audio)

# Reshape the array to (channels, samples)
samples = len(audio_array)
channels = 1
sample_rate = 22050
audio_array = audio_array.reshape(channels, samples)

from audiotsm.io.array import ArrayReader, ArrayWriter

reader = ArrayReader(audio_array)
tsm = wsola(reader.channels, speed=2) # increase the speed by 2x
rate_adjusted = ArrayWriter(channels=channels)
tsm.run(reader, rate_adjusted)

So far, things are hunky-dory. The problem comes in with Pydub

If I were to use AudioTSM's WavWriter instead, the audio is time-stretched correctly just as it would be if I had done tts_to_file and then WavReader()

The issue comes in with PyDub

If I were to just directly pass in the results of ArrayWriter's rate_adjusted.data.tobytes(), like this, we get MAJOR distorted audio

from pydub import AudioSegment

# Convert the processed audio data to a PyDub AudioSegment
processed_audio_segment = AudioSegment(
    rate_adjusted.data.tobytes(),
    frame_rate=samplerate,
    sample_width=2,
    channels=channels
)
# Perform additional audio processing

processed_audio_segment.export('tts_output.wav', format='wav')

I can't find documentation that supports this, but looking at the source for the AudioSegment __init__ I suspected it had something to do with Coqui outputting float32 AudioSegment wanting a scaled int16

converting the array actually seems to produce a somewhat usable result

# Scale the floats to whole numbers and convert
converted_audio = (rate_adjusted.data * 2**15).astype(np.int16).tobytes()

This produces an audio file that is not distorted but has a noticeable decrease in quality, and when exported is actually about 25KB smaller than the one exported by AudioTSM's WavWriter without any processing. I would guess this is because Int16 uses less data. I tried converting to Int32 instead like this:

converted_audio = (rate_adjusted.data * 2**31).astype(np.int32).tobytes()

But this actually doesn't really sound any better and takes up much more space. What am I missing here?

If I just export to wav with WavWriter and read in with AudioSegment.from_wav(), there is no distortion, the export is identical, and I don't have to convert, but again, File IO is expensive and a pain.

Is there any way to properly convert between these array formats that won't cause distortion, loss of quality, or sanity besides just turning things into wavs? I could also try other libraries, but my project has already been making heavy use of PyDub even though it's proving to be a massive thorn in the tuchus. My goal is just to perform all audio operations in memory with as much interoperability between libraries as possible.

Solution

I suspect you might be encountering distortion due to the difference in value scales between your original float32 audio array and the int16 format you're converting to. In a float wav file, values are typically scaled between -1.0 and 1.0. However, to avoid audio clipping, it's common for recordings, especially voice recordings, to utilize a narrower range, such as between -0.1 and 0.1.

It's worth noting that a float32 can hold up to 2^31 different values, so utilizing only a portion of this range isn't a significant concern. On the other hand, when dealing with int16, if you're using only a tenth of the available range, you're effectively reducing the available storage space from 32768 to 3277 different values. This reduction in the range could be leading to a noticeable decrease in audio quality.

My suspicion is that this reduced range might be the cause of the quality loss you're observing. To counter this, I would recommend rescaling your audio array to span the entire range of -1 to 1 before converting it to int16. This adjustment should enhance the data resolution. So something like:


data = data / np.max(np.abs(data))

This division by the maximum absolute value in the data ensures that the entire range is fully utilized, mitigating the potential loss of quality during the conversion process.