c++synthesizer jack sound-synthesis rtaudio

Basic software synthesizer grows in latency over time

I'm in the process of finishing a MIDI controlled software synthesizer. The MIDI input and synthesis work alright, but I appear to have a problem one playing the audio itself.

I'm using jackd as my audio server because of the possibility to configure it for low latency applications, such as in my case, real-time MIDI instruments, with alsa as the jackd backend.

In my program I'm using RtAudio which is a fairly well known C++ library to connect to various sound servers and offers basic stream operations on them. As the name says, it's optimised for real-time audio.

I also use the Vc library, which is a library that provides vectorization for various math functions, in order to speed up the additive synthesis process. I'm basically adding up a multitude of sine waves of different frequencies and amplitudes in order the produce a complex waveform on the output, like a sawtooth wave or a square wave, for example.

Now, the problem is not that latency is high to start with, as that could probably be solved or blamed on a lot of things, such as MIDI input or what not. The problem is that latency between my soft synth and the final audio output starts very low, and after a couple of minutes, it gets unbearably high.

Since I plan to use this to play "live", i.e. in my home, I can't really be bothered to play with an ever-growing latency between my keystrokes and the audio feedback I hear.

I've tried to reduce the code base that reproduces the problem all the way down, and I can't further reduce it any more.

#include <queue>
#include <array>
#include <iostream>
#include <thread>
#include <iomanip>
#include <Vc/Vc>
#include <RtAudio.h>
#include <chrono>
#include <ratio>
#include <algorithm>
#include <numeric>


float midi_to_note_freq(int note) {
    //Calculate difference in semitones to A4 (note number 69) and use equal temperament to find pitch.
    return 440 * std::pow(2, ((double)note - 69) / 12);
}


const unsigned short nh = 64; //number of harmonics the synthesizer will sum up to produce final wave

struct Synthesizer {
    using clock_t = std::chrono::high_resolution_clock;


    static std::chrono::time_point<clock_t> start_time;
    static std::array<unsigned char, 128> key_velocities;

    static std::chrono::time_point<clock_t> test_time;
    static std::array<float, nh> harmonics;

    static void init();
    static float get_sample();
};


std::array<float, nh> Synthesizer::harmonics = {0};
std::chrono::time_point<std::chrono::high_resolution_clock> Synthesizer::start_time, Synthesizer::test_time;
std::array<unsigned char, 128> Synthesizer::key_velocities = {0};


void Synthesizer::init() { 
    start_time = clock_t::now();
}

float Synthesizer::get_sample() {

    float t = std::chrono::duration_cast<std::chrono::duration<float, std::ratio<1,1>>> (clock_t::now() - start_time).count();

    Vc::float_v result = Vc::float_v::Zero();

    for (int i = 0; i<key_velocities.size(); i++) {
        if (key_velocities.at(i) == 0) continue;
        auto v = key_velocities[i];
        float f = midi_to_note_freq(i);
        int j = 0;
        for (;j + Vc::float_v::size() <= nh; j+=Vc::float_v::size()) {
            Vc::float_v twopift = Vc::float_v::generate([f,t,j](int n){return 2*3.14159268*(j+n+1)*f*t;});
            Vc::float_v harms = Vc::float_v::generate([harmonics, j](int n){return harmonics.at(n+j);});
            result += v*harms*Vc::sin(twopift); 
        }
    }
    return result.sum()/512;
}                                                                                                


std::queue<float> sample_buffer;

int streamCallback (void* output_buf, void* input_buf, unsigned int frame_count, double time_info, unsigned int stream_status, void* userData) {
    if(stream_status) std::cout << "Stream underflow" << std::endl;
    float* out = (float*) output_buf;
    for (int i = 0; i<frame_count; i++) {
        while(sample_buffer.empty()) {std::this_thread::sleep_for(std::chrono::nanoseconds(1000));}
        *out++ = sample_buffer.front(); 
        sample_buffer.pop();
    }
    return 0;
}


void get_samples(double ticks_per_second) {
    double tick_diff_ns = 1e9/ticks_per_second;
    double tolerance= 1/1000;

    auto clock_start = std::chrono::high_resolution_clock::now();
    auto next_tick = clock_start + std::chrono::duration<double, std::nano> (tick_diff_ns);
    while(true) {
        while(std::chrono::duration_cast<std::chrono::duration<double, std::nano>>(std::chrono::high_resolution_clock::now() - next_tick).count() < tolerance) {std::this_thread::sleep_for(std::chrono::nanoseconds(100));}
        sample_buffer.push(Synthesizer::get_sample());
        next_tick += std::chrono::duration<double, std::nano> (tick_diff_ns);
    }
}


int Vc_CDECL main(int argc, char** argv) {
    Synthesizer::init();

    /* Fill the harmonic amplitude array with amplitudes corresponding to a sawtooth wave, just for testing */
    std::generate(Synthesizer::harmonics.begin(), Synthesizer::harmonics.end(), [n=0]() mutable {
            n++;
            if (n%2 == 0) return -1/3.14159268/n;
            return 1/3.14159268/n;
        });

    RtAudio dac;

    RtAudio::StreamParameters params;
    params.deviceId = dac.getDefaultOutputDevice();
    params.nChannels = 1;
    params.firstChannel = 0;
    unsigned int buffer_length = 32;

    std::thread sample_processing_thread(get_samples, std::atoi(argv[1]));
    std::this_thread::sleep_for(std::chrono::milliseconds(10));

    dac.openStream(&params, nullptr, RTAUDIO_FLOAT32, std::atoi(argv[1]) /*sample rate*/, &buffer_length /*frames per buffer*/, streamCallback, nullptr /*data ptr*/);

    dac.startStream();

    bool noteOn = false;
    while(true) {
        noteOn = !noteOn;
        std::cout << "noteOn = " << std::boolalpha << noteOn << std::endl;
        Synthesizer::key_velocities.at(65) = noteOn*127;
        std::this_thread::sleep_for(std::chrono::seconds(1));
    }

    sample_processing_thread.join();
    dac.stopStream();
}

To be compiled with g++ -march=native -pthread -o synth -Ofast main.cpp /usr/local/lib/libVc.a -lrtaudio

The program expects a sample rate as first argument. In my setup I use jackd -P 99 -d alsa -p 256 -n 3 & as my sound server (requires real-time priority permissions for the current user). Since the default sample rate for jackd is 48 kHz, I run the program with ./synth 48000.

alsa could be used as a sound server, though I prefer using jackd when possible for obscure reasons including pulseaudio and alsa interactions.

If you get to run the program at all, you should hear a hopefully not too annoying sawtooth wave playing and not playing at regular intervals, with console output on when the playing should start and stop. When noteOn is set to true, the synthesizer starts producing the sawtooth wave at whatever frequency, and stops when noteOn is set to false.

You'll hopefully see that at first, noteOn true and false correspond almost perfectly with the audio playing and stopping, but little by little, the audio source starts lagging behind until it starts to get very noticeable around 1 minute to 1 minute 30 seconds on my machine.

I'm 99% sure it has nothing to do with my program for the following reasons.

The "audio" takes this path through the program.

The key is pressed.
A clock ticks at 48 kHz in the sample_processing_thread and calls Synthesizer::get_sample and passes the output to an std::queue that is used as a sample buffer for later.
Whenever the RtAudio stream needs samples, it gets them from the sample buffer and moves along.

The only thing that could be a source of ever increasing latency here is the clock ticking, but it ticks at the same rate as the stream consumes samples, so that can't be it. If the clock ticked slower, RtAudio would complain about stream underruns and there would be noticeable audio corruption, which doesn't happen.

The clock could however click faster, but I don't think that's the case, as I've tested the clock by itself in numerous occasions, and while it does show a little bit of jitter, in the order of nanoseconds, this is to be expected. There is no cumulative latency to the clock itself.

Thus, the only possible source of growing latency would be internal functions of RtAudio or the sound server itself. I have google'd around for a bit and have found nothing of use.

I have been trying to solve this for a week or two now, and I've tested everything that could be going wrong on my side, and it works as expected, so I really don't know what could be happening.

What I have tried

Checking if the clock has cumulative latency of some sort: No cumulative latency has been noticed
Timing the delay between key presses and the first sample of audio being produced to see if this delay grew with time: Delay did not grow with time
Timing the delay between the stream asking for samples and the samples being sent to the stream (start and end of stream_callback): Delay did not grow with time

Solution

I think your get_samples thread generates audio faster or slower than streamCallback consumes them. Using clock for flow control is unreliable.

Simple way to fix, remove that thread and sample_buffer queue and generate samples directly in streamCallback function.

If you do want to use multithreading for your app, it requires proper synchronization between producer and consumer. Much more complex. But in short, the steps are below.

Replace your queue with a reasonably small fixed-length circular buffer. Technically, std::queue will work too, just slower because pointer-based, and you need to manually limit the max.size.
In producer thread implement endless loop that checks is there empty space in the buffer, if there’s space generate more audio, if not, wait for the consumer to consume the data from the buffer.
In consumer streamCallback callback, copy data from circular buffer to output_buf. If there’s not enough data available, wake the producer thread and wait for it to produce the data.

Unfortunately an efficient implementation of that is quite tricky. You need synchronization to protect shared data, but you don’t want too much synchronization otherwise producer and consumer will be serialized and will only use a single hardware thread. One approach is single std::mutex to protect the buffer while moving pointers/size/ofset (but unlock while reading/writing the data), and two std::condition_variable, one for the producer to sleep when there’s no free space in the buffer, another one for the consumer to sleep when there’s no data in the buffer.