audiosignal-processingvoicepitch-tracking

Pitch detection via auto correlation fails on higher pitches


I'm trying to get the pitch class from recorded voice (44.1 kHz) using autocorrelation. What I'm doing is basically described here: http://cnx.org/content/m11714/latest/ and also implemented here: http://code.google.com/p/yaalp/source/browse/trunk/csaudio/WaveAudio/WaveAudio/PitchDetection.cs (the part using PitchDetectAlgorithm.Amdf)

So in order to detect the pitch class I build up an array with the normalized correlation for the frequencies of C2 to B3 (2 octaves) and select the one with the highest value (doing a "1 - correlation" transformation first so not searching for minimum but maximum)

I tested it with generated audio (simple sinus):

data[i] = (short)(Math.Sin(2 * Math.PI * i/fs * freq) * short.MaxValue);

But it only works for input frequencies lower than B4. Investigating the generated array I found that starting from G3 another peek evolved that eventually gets bigger than the correct one. and my B4 is detected as an E. Changing the number of analysed frequencies did not help at all.

My buffer size is 4000 samples and frequency of B4 is ~493Hz, so I cannot think of a reason why this is failing. Are there any more constraints on the frequencies or buffer sizes? What is going wrong there?

I'm aware that I could use FFT like Performous is using, but using this method looked simple and also gives weighted frequencies that can be used to show visualisations. I don't want to throw it away that easily and at least understand why this fails.

Update: Core function used:

private double _GetAmdf(int tone)
    {
        int samplesPerPeriod = _SamplesPerPeriodPerTone[tone]; // samples in one period
        int accumDist = 0; // accumulated distances
        int sampleIndex = 0; // index of sample to analyze
        // Start value= index of sample one period ahead
        for (int correlatingSampleIndex = sampleIndex + samplesPerPeriod; correlatingSampleIndex < _AnalysisBufLen; correlatingSampleIndex++, sampleIndex++)
        {
            // calc distance (correlation: 1-dist/IntMax*2) to corresponding sample in next period (0=equal .. IntMax*2=totally different)
            int dist = Math.Abs(_AnalysisBuffer[sampleIndex] - _AnalysisBuffer[correlatingSampleIndex]);
            accumDist += dist;
        }

        return 1.0 - (double)accumDist / Int16.MaxValue / sampleIndex;
    }

With that function, the pitch/tone is (pseudocode)

tone = Max(_GetAmdf(tone)) <- for tone = C2..

I also tried using actual autocorrelation with:

double accumDist=0;
//...
double dist = _AnalysisBuffer[sampleIndex] * _AnalysisBuffer[correlatingSampleIndex];
//...
const double scaleValue = (double)Int16.MaxValue * (double)Int16.MaxValue;
return accumDist / (scaleValue * sampleIndex);

but that fails getting an A3 as an D in addition to B4 as an E

Note: I do not divide by Bufferlength but by the number of samples actually compared. Not sure if this is right, but it seems logic.


Solution

  • This is the common octave problem with using autocorrelation and similar lag estimations of pitch (AMDF, ASDF, etc.)

    A frequency that is one octave (or any other integer multiple) lower will also give as good a match in shifted waveform similarity (e.g. a sinewave shifted by 2pi looks the same as one shifted by 4pi, which represents an octave lower. Depending on noise and how close the continuous peak is to the sampled peak, one or the other estimation peak may be slightly higher, with no change in pitch.

    So some other test needs to be used to remove lower octave (or other submultiple frequency) peaks in the waveform correlation or lag matching (e.g. does a peak look close enough like one or more other peaks, one or more octaves or other frequency multiples up, etc.)