python python-3.x sockets voip python-sounddevice

How do I make the voice delay disappear?

I was trying to do something similar to VoIP where I record voice and send it to another program on the network using UDP, it's not a question about encryption, but when I ran the code it worked, apart from the fact that the audio came out choppy.

In other words, in some words that I dropped I could hear them in full, but other longer phrases could always identify the moment when a signal was interrupted and he waited for another packet to be delivered to continue transmitting.

I'm asking how do I make my voice sound soft on the receiving side? Because I tried using Threading to try to optimize the recording but it didn't make much difference and I don't know where else to go.

The Server Side:

import sounddevice as sd
import socket, pickle

s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)

h = socket.gethostbyname(socket.gethostname())

s.bind((h,9001))

print("Servidor Rodando em "+str(h)+":9001")

while True:
    r = pickle.loads(s.recvfrom(102400)[0])
    sd.play(r,4410)

The Client Side:

import sounddevice as sd
import socket, pickle, threading

s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)

ip = input("IP >> ")

data = None

def Enviar():
    global data
    s.sendto(pickle.dumps(data),(ip,9001))

while True:
    data = sd.rec(4410, samplerate=4410, channels=2)
    sd.wait()
    threading.Thread(target=Enviar, args=()).start()

Solution

With computer audio, the receiving computer's sound card has a sample clock that determines how fast it converts audio sample values into electrical signals that drive the speaker. The sample clock runs at a fixed rate (e.g. 48000 samples per second, or whatever you've set it to) and in order for the audio to sound correct, a new audio sample must be fed into the sound card every 1/48000th of a second.

In order to reduce the CPU load on the host computer, the sound card usually has a built-in audio buffer, so that instead of forcing the CPU to wake up every 1/48000th of a second to send exactly one sample, you can instead have the CPU wake up e.g. once every 100mS, and write in 4800 samples all at once. The sound card's internal electronics would then manage feeding the individual samples from that buffer instead.

Therefore, the secret to continuous sound is never to let the sound card's buffer become empty. When the buffer is drained to empty (and therefore the sound card can't get the next sample to play at the instant it needs to play it) that is known as an audio underrun and it causes a glitch in the audio, as you heard.

The easiest way to prevent the underruns is to buffer up more audio on the receiving computer, so that more time can pass without data being received before an underrun occurs. Of course, the downside of this is that there will be more latency between the time the sender sends the data and the time receiver plays it; that's probably okay for e.g. streaming recorded music, but not so good for a live voice conversation.

The harder approach is to ensure that all data makes it across the network in a short amount of time; to do this with guaranteed reliability you need a special networking switch that allows devices to pre-reserve bandwidth so that they can guarantee that their audio packets won't get dropped. Without this guarantee, you are left just hoping for the best; on a wired Ethernet connection you can often get away with it for a small number of audio channels, but over WiFi, as you've seen, the network is often very unreliable and so you will probably hear underrun-glitches in many situations, unless you dial up the buffering quite a lot.

Some protocols use Forward Error Correction math to encode the audio in such a way that even if some subset of the UDP packets are lost, the original audio sample values can still be reconstructed from the remaining packets that were received. That increases the overall bandwidth usage somewhat, but it allows audio to avoid glitching as long as the number of dropped packets is relatively small. I'm not very familiar with how they work, however, so I can't say more about that.

The final approach (which I think is what you are asking about) is to have the receiving computer somehow try to "paper over" the missing audio by making up its own replacement sample-values for the missing audio. There are voice protocols that try to do this, with varying degrees of success (you've probably heard the results when talking over a bad cell-phone connection), but IMHO it's not really worth implementing, because there will still be an obvious glitch in the audio; just a different-sounding glitch. It might be worthwhile to fade the last samples of the received audio out to zero if you don't have more samples to follow them (to at least avoid an abrupt "pop") and then after new (post-underrun) audio is received, fade the first samples of the newly-received audio in as well (to avoid a second "pop"), but that only makes the glitch less annoying; it doesn't get rid of it.