ios avfoundation avassetreader cmsamplebuffer

First audio CMSampleBuffer lost when reading mp4 file using AVAssetReader

I'm using AVAssetWriter to write audio CMSampleBuffer to an mp4 file, but when I later read that file using AVAssetReader, it seems to be missing the initial chunk of data.

Here's the debug description of the first CMSampleBuffer passed to writer input append method (notice the priming duration attachement of 1024/44_100):

CMSampleBuffer 0x102ea5b60 retainCount: 7 allocator: 0x1c061f840
            invalid = NO
            dataReady = YES
            makeDataReadyCallback = 0x0
            makeDataReadyRefcon = 0x0
            buffer-level attachments:
                TrimDurationAtStart    = {
            epoch = 0;
            flags = 1;
            timescale = 44100;
            value = 1024;
        }
            formatDescription = <CMAudioFormatDescription 0x281fd9720 [0x1c061f840]> {
            mediaType:'soun' 
            mediaSubType:'aac ' 
            mediaSpecific: {
                ASBD: {
                    mSampleRate: 44100.000000 
                    mFormatID: 'aac ' 
                    mFormatFlags: 0x2 
                    mBytesPerPacket: 0 
                    mFramesPerPacket: 1024 
                    mBytesPerFrame: 0 
                    mChannelsPerFrame: 2 
                    mBitsPerChannel: 0  } 
                cookie: {<CFData 0x2805f50a0 [0x1c061f840]>{length = 39, capacity = 39, bytes = 0x03808080220000000480808014401400 ... 1210068080800102}} 
                ACL: {(null)}
                FormatList Array: {
                    Index: 0 
                    ChannelLayoutTag: 0x650002 
                    ASBD: {
                    mSampleRate: 44100.000000 
                    mFormatID: 'aac ' 
                    mFormatFlags: 0x0 
                    mBytesPerPacket: 0 
                    mFramesPerPacket: 1024 
                    mBytesPerFrame: 0 
                    mChannelsPerFrame: 2 
                    mBitsPerChannel: 0  }} 
            } 
            extensions: {(null)}
        }
            sbufToTrackReadiness = 0x0
            numSamples = 1
            outputPTS = {6683542167/44100 = 151554.244, rounded}(based on cachedOutputPresentationTimeStamp)
            sampleTimingArray[1] = {
                {PTS = {6683541143/44100 = 151554.221, rounded}, DTS = {6683541143/44100 = 151554.221, rounded}, duration = {1024/44100 = 0.023}},
            }
            sampleSizeArray[1] = {
                sampleSize = 163,
            }
            dataBuffer = 0x281cc7a80

Here's the debug description of the second CMSampleBuffer (notice the priming duration attachement of 1088/44_100, which combined with the previous trim duration yields the standard value of 2112):

CMSampleBuffer 0x102e584f0 retainCount: 7 allocator: 0x1c061f840
    invalid = NO
    dataReady = YES
    makeDataReadyCallback = 0x0
    makeDataReadyRefcon = 0x0
    buffer-level attachments:
        TrimDurationAtStart    = {
    epoch = 0;
    flags = 1;
    timescale = 44100;
    value = 1088;
}
    formatDescription = <CMAudioFormatDescription 0x281fd9720 [0x1c061f840]> {
    mediaType:'soun' 
    mediaSubType:'aac ' 
    mediaSpecific: {
        ASBD: {
            mSampleRate: 44100.000000 
            mFormatID: 'aac ' 
            mFormatFlags: 0x2 
            mBytesPerPacket: 0 
            mFramesPerPacket: 1024 
            mBytesPerFrame: 0 
            mChannelsPerFrame: 2 
            mBitsPerChannel: 0  } 
        cookie: {<CFData 0x2805f50a0 [0x1c061f840]>{length = 39, capacity = 39, bytes = 0x03808080220000000480808014401400 ... 1210068080800102}} 
        ACL: {(null)}
        FormatList Array: {
            Index: 0 
            ChannelLayoutTag: 0x650002 
            ASBD: {
            mSampleRate: 44100.000000 
            mFormatID: 'aac ' 
            mFormatFlags: 0x0 
            mBytesPerPacket: 0 
            mFramesPerPacket: 1024 
            mBytesPerFrame: 0 
            mChannelsPerFrame: 2 
            mBitsPerChannel: 0  }} 
    } 
    extensions: {(null)}
}
    sbufToTrackReadiness = 0x0
    numSamples = 1
    outputPTS = {6683543255/44100 = 151554.269, rounded}(based on cachedOutputPresentationTimeStamp)
    sampleTimingArray[1] = {
        {PTS = {6683542167/44100 = 151554.244, rounded}, DTS = {6683542167/44100 = 151554.244, rounded}, duration = {1024/44100 = 0.023}},
    }
    sampleSizeArray[1] = {
        sampleSize = 179,
    }
    dataBuffer = 0x281cc4750

Now, when I read the audio track using AVAssetReader, the first CMSampleBuffer I get is:

CMSampleBuffer 0x102ed7b20 retainCount: 7 allocator: 0x1c061f840
    invalid = NO
    dataReady = YES
    makeDataReadyCallback = 0x0
    makeDataReadyRefcon = 0x0
    buffer-level attachments:
        EmptyMedia(P) = true
    formatDescription = (null)
    sbufToTrackReadiness = 0x0
    numSamples = 0
    outputPTS = {0/1 = 0.000}(based on outputPresentationTimeStamp)
    sampleTimingArray[1] = {
        {PTS = {0/1 = 0.000}, DTS = {INVALID}, duration = {0/1 = 0.000}},
    }
    dataBuffer = 0x0

and the next one is contains priming info of 1088/44_100:

CMSampleBuffer 0x10318bc00 retainCount: 7 allocator: 0x1c061f840
    invalid = NO
    dataReady = YES
    makeDataReadyCallback = 0x0
    makeDataReadyRefcon = 0x0
    buffer-level attachments:
        FillDiscontinuitiesWithSilence(P) = true
        GradualDecoderRefresh(P) = 1
        TrimDurationAtStart(P) = {
    epoch = 0;
    flags = 1;
    timescale = 44100;
    value = 1088;
}
        IsGradualDecoderRefreshAuthoritative(P) = false
    formatDescription = <CMAudioFormatDescription 0x281fdcaa0 [0x1c061f840]> {
    mediaType:'soun' 
    mediaSubType:'aac ' 
    mediaSpecific: {
        ASBD: {
            mSampleRate: 44100.000000 
            mFormatID: 'aac ' 
            mFormatFlags: 0x0 
            mBytesPerPacket: 0 
            mFramesPerPacket: 1024 
            mBytesPerFrame: 0 
            mChannelsPerFrame: 2 
            mBitsPerChannel: 0  } 
        cookie: {<CFData 0x2805f3800 [0x1c061f840]>{length = 39, capacity = 39, bytes = 0x03808080220000000480808014401400 ... 1210068080800102}} 
        ACL: {Stereo (L R)}
        FormatList Array: {
            Index: 0 
            ChannelLayoutTag: 0x650002 
            ASBD: {
            mSampleRate: 44100.000000 
            mFormatID: 'aac ' 
            mFormatFlags: 0x0 
            mBytesPerPacket: 0 
            mFramesPerPacket: 1024 
            mBytesPerFrame: 0 
            mChannelsPerFrame: 2 
            mBitsPerChannel: 0  }} 
    } 
    extensions: {{
    VerbatimISOSampleEntry = {length = 87, bytes = 0x00000057 6d703461 00000000 00000001 ... 12100680 80800102 };
}}
}
    sbufToTrackReadiness = 0x0
    numSamples = 43
    outputPTS = {83/600 = 0.138}(based on outputPresentationTimeStamp)
    sampleTimingArray[1] = {
        {PTS = {1024/44100 = 0.023}, DTS = {1024/44100 = 0.023}, duration = {1024/44100 = 0.023}},
    }
    sampleSizeArray[43] = {
        sampleSize = 179,
        sampleSize = 173,
        sampleSize = 178,
        sampleSize = 172,
        sampleSize = 172,
        sampleSize = 159,
        sampleSize = 180,
        sampleSize = 200,
        sampleSize = 187,
        sampleSize = 189,
        sampleSize = 206,
        sampleSize = 192,
        sampleSize = 195,
        sampleSize = 186,
        sampleSize = 183,
        sampleSize = 189,
        sampleSize = 211,
        sampleSize = 198,
        sampleSize = 204,
        sampleSize = 211,
        sampleSize = 204,
        sampleSize = 202,
        sampleSize = 218,
        sampleSize = 210,
        sampleSize = 206,
        sampleSize = 207,
        sampleSize = 221,
        sampleSize = 219,
        sampleSize = 236,
        sampleSize = 219,
        sampleSize = 227,
        sampleSize = 225,
        sampleSize = 225,
        sampleSize = 229,
        sampleSize = 225,
        sampleSize = 236,
        sampleSize = 233,
        sampleSize = 231,
        sampleSize = 249,
        sampleSize = 234,
        sampleSize = 250,
        sampleSize = 249,
        sampleSize = 259,
    }
    dataBuffer = 0x281cde370

The input append method keeps returning true which in principle means that all sample buffers got appended, but the reader for some reason skips the first chunk of data. Is there anything I'm doing wrong here?

I'm using the following code to read the file:

let asset = AVAsset(url: fileURL)
guard let assetReader = try? AVAssetReader(asset: asset) else {
    return
}

asset.loadValuesAsynchronously(forKeys: ["tracks"]) { in
    guard let audioTrack = asset.tracks(withMediaType: .audio).first else { return }
    let audioOutput = AVAssetReaderTrackOutput(track: audioTrack, outputSettings: nil)
    assetReader.startReading()

    while assetReader.status == .reading {
        if let sampleBuffer = audioOutput.copyNextSampleBuffer() {
            // do something
        }
    }
}

Solution

First some pedantry: you haven't lost your first sample buffer, but rather the first packet within your first sample buffer.

The behaviour of AVAssetReader with nil outputSettings when reading AAC packet data has changed on iOS 13 and macOS 10.15 (Catalina).

Previously you would get the first AAC packet, that packet's presentation timestamp (zero) and a trim attachment instructing you to discard the usual first 2112 frames of decoded audio.

Now [iOS 13, macOS 10.15] AVAssetReader seems to discard the first packet, leaving you the second packet, whose presentation timestamp is 1024, and you need only discard 2112 - 1024 = 1088 of the decoded frames.

Something that might not be immediately obvious in the above situations is that AVAssetReader is talking about TWO timelines, not one. The packet timestamps are referred to one, the untrimmed timeline, and the trim instruction implies the existence of another: the untrimmed timeline.

The transformation from untrimmed to trimmed timestamps is very simple, it's usually trimmed = untrimmed - 2112.

So is the new behaviour a bug? The fact that if you decode to LPCM and correctly follow the trim instructions, then you should still get the same audio, leads me believe the change was intentional (NB: I haven't yet personally confirmed the LPCM samples are the same).

However, the documentation says:

A value of nil for outputSettings configures the output to vend samples in their original format as stored by the specified track.

I don't think you can both discard packets [even the first one, which is basically a constant] and claim to be vending samples in their "original format", so from this point of view I think the change has a bug-like quality.

I also think it's an unfortunate change as I used to consider nil outputSettings AVAssetReader to be a sort of "raw" mode, but now it assumes your only use case is decoding to LPCM.

There's only one thing that could downgrade "unfortunate" to "serious bug", and that's if this new "let's pretend the first AAC packet doesn't exist" approach extends to files created with AVAssetWriter because that would break interoperability with non-AVAssetReader code, where the number of frames to trim has congealed to a constant 2112 frames. I also haven't personally confirmed this. Do you have a file created with the above sample buffers that you can share?

p.s. I don't think your input sample buffers are relevant here, I think you'd lose the first packet reading from any AAC file. However your input sample buffers seem slightly unusual in that they have hosttime [capture session?] style timestamps, yet are AAC, and only have one packet per sample buffer, which isn't very many and seems like a lot of overhead for 23ms of audio. Are you creating them them yourself in an AVCaptureSession -> AVAudioConverter chain?