node.jsaudiotext-to-speechpcmtyped-arrays

How to use nvidia jarvis tts in nodejs


i am trying to convert the python jarvis tts example to nodejs. i am able to get the audio back from jarvis but when playing it is having a lot of noise

in the python example they are using 16bit as the bit depth but with the same im getting very stretched audio on node.

the audio is Uncompressed 16-bit signed little-endian samples (Linear PCM). from what i understand from their protofiles

                   datalen = len(resp.audio) // 4
                   data32 = np.ndarray(buffer=resp.audio, dtype=np.float32, shape=(datalen, 1))
                   data16 = np.int16(data32 * 23173.26)  # / 1.414 * 32767.0)
                   speech = bytes(data16.data)
                   print(speech)

i have tried with typed arrays converting to float32array and then to int16array but no luck.

it sounds good from the python implementation but from node it has too much noise.

the parameters passed to the tts grpc is the same on both.

Edit:

i was able to get the audio to play with minimal noise at 16bit depth with this code. but there is still some disturbance.

this.ttsClient.Synthesize(SynthesizeSpeechRequest, (err, resp) => {
                console.log(err);
                const b16 = new Float32Array(resp.audio.length / 4);
                const v = new DataView(resp.audio.buffer);

                for (let i = 0; i < resp.audio.byteLength; i += 4) {
                    const element = v.getFloat32(i);
                    b16[i / 4] = element;
                }
                let l = b16.length;
                const buf = new Int16Array(l);

                while (l--) {
                    const s = Math.max(-1, Math.min(1, b16[l]));
                    buf[l] = s < 0 ? s * 0x8000 : s * 0x7fff;
                    // buf[l] = s[l] * 0x7fff; //old   //convert to 16 bit
                }

                buf.map((x) => x * 23173.26);
                
                
                console.log("buf", buf);
                // // // new Int16Array().BYTES_PER_ELEMENT
                cb(Buffer.from(buf.buffer));
            });

Solution

  • i was able to solve this by reading by using readFloatLE instead of a dataview.

            this.ttsClient.Synthesize(SynthesizeSpeechRequest, (err, resp) => {
                    if (err) console.log(err);
                    const b16 = new Float32Array(Math.floor(resp.audio.length));
    
                    for (let i = 0; i < resp.audio.byteLength; i += 4) {
                        const element = resp.audio.readFloatLE(i);
                        b16[i / 4] = element;
                    }
                    let l = b16.length;
    
                    const buf = new Uint16Array(l);
    
                    while (l--) {
                        const s = Math.max(-1, Math.min(1, b16[l]));
                        buf[l] = s < 0 ? s * 0x8000 : s * 0x7fff;
                    }
                    b16.map((x) => x * 23173.26);
                    cb(Buffer.from(buf.buffer));
                });