ffmpegmultimedia

Decoded AVFrame's pts points at the end of the samples?


I'm using FFMPEG to decode a video, the decoded audio frames have strange pts values.

Here is the output:

Audio Input 13 13 576 
Audio Input 36 23 1024
Audio Input 59 23 1024
Audio Input 82 23 1024 
Audio Input 105 23 1024
Audio Input 129 24 1024

The first number is AVFrame->pts, second number is AVFrame->pts - previous AVFrame->pts, the third number is AVFrame->nb_samples.

Decoder's time_base is 1/1000, sample rate is 44100.

It seems that the pts is pointing at the end of the samples, not the beginning.

First AVFrame contains 516 samples, 516/44100*1000=13.06... which equals to first AVFrame->pts.

Second AVFrame contains 1024 samples, 1024/44100*1000=23.21... which equals to the second AVFrame->pts - first AVFrame->pts

Is that normal?

Since I will feed these AVFrames into a amix filter. The amix filter will rescale the pts with 1/44100 time base, which equals to 13 / 1000 * 44100 = 573, but I only asked for 64 samples, so amix returns first AVFrame with a 573 pts and 64 nb_samples. Which is definitely wrong. And if I keep asking for more samples, amix will return with increasing pts. Like this:

Audio Frame 573 573 64
Audio Frame 637 64 64
Audio Frame 701 64 64
Audio Frame 765 64 64
Audio Frame 829 64 64
Audio Frame 893 64 64
Audio Frame 957 64 64
Audio Frame 1021 64 64
Audio Frame 1085 64 64

What is worse, if decoder returns smaller nb_samples. amix will return pts less than the last one.

Audio Input 8488 23 1024
Audio Frame 374321 54 64
Audio Frame 374385 64 64
Audio Frame 374449 64 64
Audio Frame 374513 64 64
Audio Frame 374577 64 64
Audio Frame 374641 64 64
Audio Frame 374705 64 64
Audio Frame 374769 64 64
Audio Frame 374833 64 64
Audio Frame 374897 64 64
Audio Frame 374961 64 64
Audio Frame 375025 64 64
Audio Frame 375089 64 64
Audio Frame 375153 64 64
Audio Frame 375217 64 64
Audio Frame 375281 64 64
Audio Input 8501 13 576
Audio Frame 374894 -387 64

8488 / 1000 * 44100 = 374320.8

374321 + 64 * (1024/64-1) = 375281

But 8501/1000*44100=374894 < 375281, the dts is decreasing.

I can fix this by shifting decoded AVFrame's pts, but I'm not sure I am doing right.

The ffmpeg's version I'm using is n5.1.2-f31542651f.clean (well, vcpkg store the source code inside this directory, I guess this is the version?)

Audio decoder is libvorbis, container format is webm, video decoder is libvpx v1.12.0

Here is the video decoding code:

#include "Decoder.h"
#include <exception>
#include <stdexcept>
#include <iostream>


Decoder::Decoder(const char *filename)
    : formatContext(nullptr, closeAVFormatContext),
      videoCodecContext(nullptr, closeAVCodecContext),
      audioCodecContext(nullptr, closeAVCodecContext),
      avVideoFrame(newAVFramePtr()),
      avPacket(newAVPacketPtr()),
      avAudioFrame(newAVFramePtr()) {

    int ret;
    {
        auto pContext = avformat_alloc_context();
        if(!pContext) {
            throw std::bad_alloc();
        }
        ret = avformat_open_input(&pContext, filename, nullptr, nullptr);
        throwAVError(ret, "avformat_open_input Error");
        this->formatContext.reset(pContext);
    }
    avformat_find_stream_info(this->formatContext.get(), nullptr);
    this->videoIndex = -1;
    this->audioIndex = -1;
    for (int i = 0; i < this->formatContext->nb_streams; ++i) {
        AVStream *pStream = this->formatContext->streams[i];
        if(pStream->codecpar->codec_type == AVMEDIA_TYPE_VIDEO && this->videoIndex == -1) {
            this->_timeBase = pStream->time_base;
            this->videoIndex = i;
            this->videoParameters = pStream->codecpar;
            this->videoCodec = avcodec_find_decoder(this->videoParameters->codec_id);
            if(this->videoCodec == nullptr) {
                throw AVException(std::string("cannot found a suitable decoder ") + avcodec_get_name(this->videoParameters->codec_id));
            }
        }
        if(pStream->codecpar->codec_type == AVMEDIA_TYPE_AUDIO && this->audioIndex == -1) {
            this->audioIndex = i;
            this->audioParameters = pStream->codecpar;
            this->sampleRate = this->audioParameters->sample_rate;
            this->audioCodec = avcodec_find_decoder(this->audioParameters->codec_id);
            if(this->audioCodec == nullptr) {
                throw AVException(std::string("cannot found a suitable decoder ") + avcodec_get_name(pStream->codecpar->codec_id));
            }
        }
        if(this->audioIndex != -1 && this->videoIndex != -1) {
            break;
        }
    }

    this->videoCodecContext = createCodecContext(this->videoParameters, this->videoCodec);
    this->audioCodecContext = createCodecContext(this->audioParameters, this->audioCodec);

}

AVCodecContextPtr Decoder::createCodecContext(AVCodecParameters *parameters, const AVCodec* codec) {
    AVCodecContextPtr context(nullptr, closeAVCodecContext);
    {
        auto ptr = avcodec_alloc_context3(codec);
        if(!ptr) {
            throw std::bad_alloc();
        }
        context.reset(ptr);
    }
    int ret = avcodec_parameters_to_context(context.get(), parameters);
    throwAVError(ret, "avcodec_parameters_to_context Error");
    ret = avcodec_open2(context.get(), codec, nullptr);
    throwAVError(ret, "avcodec_open2");
    return context;
}

Result Decoder::nextFrame() {
    int ret;
    do {
        ret = av_read_frame(this->formatContext.get(), this->avPacket.get());
        if(ret == AVERROR_EOF) {
            av_packet_unref(this->avPacket.get());
            return Result::eof;
        }
        throwAVError(ret, "av_read_frame Error");
    } while(this->avPacket->stream_index != this->videoIndex && this->avPacket->stream_index != this->audioIndex);
    bool isVideo = this->avPacket->stream_index == this->videoIndex;
    const auto& context = isVideo ? this->videoCodecContext : this->audioCodecContext;
    const auto& frame = isVideo ? this->avVideoFrame : this->avAudioFrame;
    ret = avcodec_send_packet(context.get(), this->avPacket.get());
    av_packet_unref(this->avPacket.get());
    throwAVError(ret, "avcodec_send_packet Error");
    ret = avcodec_receive_frame(context.get(), frame.get());
    if(ret == AVERROR(EAGAIN)) {
        return Result::again;
    }
    throwAVError(ret, "avcodec_receive_frame Error");
    return isVideo ? Result::video : Result::audio;
}



The video file is downloaded here: https://www.webmfiles.org/demo-files/

Direct link: https://dl6.webmfiles.org/big-buck-bunny_trailer.webm


Solution

  • Hah, vorbis-in-webm. May I ask how the file was generated? I'm guessing you had a video file encoded by some VP9 encoder (doesn't really matter which one) into a temp file format (possible ivf), and you encoded the audio separately into ogg/vorbis, and used some software to mux these two together into webm. I'm wondering what software did that final step.

    I guess that audio encoding was separate and the intermediate container for audio was ogg, because ogg/vorbis actually stores timestamps ("granules") as end-of-packet instead of beginning-of-packet. I believe your muxing software isn't aware of this and stores the granule (end-of-packet) as timestamp (start-of-packet) when converting to webm. The obvious solution would be to use different software for the audio/video muxing stage, or to remove the separate steps entirely and just encode using one big ffmpeg commandline.

    (Other explanations or bugs are also possible, but please elaborate on how the file was generated.)