vb.net mp4 video-processing h.264 file-format

Offset from the start of the “mdat” box to the first frame

As a personal programming challenge, I have decided to write an MP4 decoder without using external libraries. To achieve this, I am using VB.NET with the .NET Framework 4.8.1 as a WinForms application, and I have purchased the documentation ISO 14496-12.

I have a function that reads the properties of an MP4 file (width, height, etc.), as well as the boxes that are important for extracting frames: Chunk offsets (stco), stsc box (First chunks, Samples per chunk, Sample description index), Sample Sizes, and stts box.

Then, in the second function, I use loops to iterate through these lists and create byte arrays with the correct data. However, I noticed that the first chunk position starts just after the beginning of “mdat”. Upon further inspection, I found that there is readable text (informational metadata) within the “mdat” box, which means it can't be image data or frame data.

This means I need some sort of offset from the start of the “mdat” box to the first frame and to extract the frames.

To try and solve this issue, I attempted to find the start bytes of a frame (the 0, 0, 1 NAL unit), but unfortunately, I have been unsuccessful in locating them. The NAL unit is absent in multiple files. I read about that NAL unit in the internet. As mentioned, I have even purchased the documentation and searched for various keywords, but I have not yet found a solution. I've googled some possible answers, too.

These are the boxes that I'm parsing:
• ftyp
• mdat
• moov
and in moov:
• mvhd
• trak
• tkhd
• mdhd
• hdlr
• smhd
• stsd
• stts
• stsc
• stsz
• stco
According to the documentation, all the other boxes are not mandatory. I couldn't find a ‘saio’ box for auxiliary offsets or a ‘meta’ box.

What drives me nuts is that there are questions on Stack Overflow about decoding an MP4, but no one has this problem.

Any guidance or suggestions would be appreciated.

Edit 04.06.2023

Private Function Skip_text_metadata_and_find_first_frame() As Integer ' Thanks to VC.One, Stack Overflow; May 30, 2023.
        Dim tempPos As Integer = Me.Mdat_Start_pos
        tempPos += 4
        While (True)
            Dim tempNum As Integer = Get_lower_bits_of_a_byte(Me.Data(tempPos + 4), 5) ' NALU type
            If tempNum <> 5 Then ' 101b is key frame
                Dim size_NALU As UInteger = Me.Data(tempPos + 0) * 256UI * 256UI * 256UI +
                                            Me.Data(tempPos + 1) * 256UI * 256UI +
                                            Me.Data(tempPos + 2) * 256UI +
                                            Me.Data(tempPos + 3)
                tempPos += (CInt(size_NALU) + 4)
                If tempPos > (Me.Mdat_End - 4) Then
                    Return 0
                End If
            Else

                Return (tempPos - Me.Mdat_Start_pos)
                Exit While
            End If
        End While

        Return 0
    End Function

where

Private Shared Function Get_lower_bits_of_a_byte(value As Byte, bitNumber As Integer) As Integer
        Dim two_to_the_bitnumber_minus1 As Integer = CInt(Math.Pow(2, bitNumber)) - 1
        Return value And two_to_the_bitnumber_minus1
    End Function

Solution

The solution is to simply use its NALU size to skip that text metadata (actually called SEI data).
You will land on the next NALU which might be a video frame (or else keep skipping by size).
PS: There is no 0, 0, 0, 1 start codes when NALU is inside an MP4 (replaced by a size integer).

Since you are learning MP4 bytes, I will add a more detailed summary...

MDAT is just a collection of NAL units, each unit is prefixed with four bytes for its size (length).
(MP4 has Size before NAL, instead of using Start-code as found in raw H.264 or MPEG-TS files).
A typical MP4 layout:
[MDAT size in 4 bytes] --> [MDAT header in 4 bytes "m","d","a","t"] then follows NALUs.
[NAL #1 size (4 bytes)] --> [NAL #1 data] --> [NAL #2 size (4 bytes)] --> [NAL #2 data].
Your first NAL unit starts after those four mdat text bytes.
Each NAL starts with [Size] (not start code).
The fifth byte is the NAL unit's header (check NAL type, eg: is metadata? or video frame?).
You skip by reading the first four bytes for NALU's Size at the start of each NAL.
Use the Size amount to forward your byte checking position (eg: increment it by += Size).

In summary, You have SEI as your first NAL unit but if you skip by its size you should land on the next NALU which would be a video frame (and is a first frame so it's expected to be a "keyframe").
Test on a MP4 file with no audio to simplify the amount of NALU types during your practice.

Solution: (with pseudo-code as example)

Your text of mdat.....ÿÿ}ÜEé makes it hard to know the actual byte values.

Assuming your data has a layout like...

in text:  m  d  a  t  .  .  .  .  .  ÿ  ÿ  }  Ü  E  é
in  hex:  6D 64 61 74 AA BB CC DD XX FF FF 7D DC 45 E9

The structure of those bytes means...

6D 64 61 74 = four bytes of ASCII text as "mdat".
AA BB CC DD = A 32-bit integer for Size of NAL unit (eg: 00 00 02 72 or decimal 626).
XX = Is byte a 0x06 (decimal: 6)? If yes, that means this NAL unit's content is of type: SEI.
(a byte 0x65 or decimal value of 101 would mean this NAL is a video keyframe).
FF FF 7D DC 45 E9 ...etc = Beginning of SEI content (eg: ÿÿ}ÜEé ...etc).

The layout looks like you have a NAL unit of SEI metadata.
SEI (Supplementary Enhancement Information) is a form of side information that is useful to the decoder but not always needed. If the H264 is inside an MP4 (which itself has an "AVC Config" section) then SEI is not needed by the decoder/player. It can be safely removed in most MP4 files (just some encoder choose to add it in preparation for future use cases like fragmenting etc)...

The first four dots (AA BB CC DD) after letters "mdat" represents four bytes (a 32-bit integer) that you must read in order to get the size of this NAL unit. You can use some built-in readInt function or alternatively you can concat the four separate byte values into one single integer.
The fifth byte (XX) is the NALU type. Read that byte's value into a variable then check as:
NAL_type = myValue & 0x1F; where myValue is the extracted MP4 byte value.
Read the integer into a var (eg: size_NALU) and increase your File/Array position to match += size_NALU. That will move you onto the next NAL unit.
It is expected to be a keyframe so you can check the NAL type by ignoring its starting four size bytes and then getting that new fifth byte value (as myValue), to check NAL type as type = myValue & 0x1F.
A simple trick is that usually if myValue is decimal 101 (or hex "65") then it is a keyframe.
If true: then you have found the first keyframe.
If false: then read its Size bytes and use to skip to the next NAL to check that one's type etc.

Example pseudo-code:

//# Vars setup
int myPos = 0; //# offset/position within MP4 bytes
int myNum = 0; //# holds temporary numeric values
int size_NALU = 0; //# size of NAL unit in bytes length.
int startPos_of_mdat = some_Num; //# use actual position for start of "mdat"

//# Vars temp numbers to create Integer from (bytes) Array values
int tempA = 0; int tempB = 0; int tempC = 0; int tempD = 0;

//# Main code
int myPos = startPos_of_mdat //# Is pos of the starting "m" letter/byte of "mdat"
myPos += 4; //# Move forward +4 bytes to reach the first NALU (ie: its first size byte)

While( true ) //# search by skipping according to Size, then check NALU type...
{
    myNum = ( MP4_Bytes[ myPos+4 ] & 0x1F ); //# extract the "NALU type" value
    
    if( myNum != 5 ) //# if not keyframe, then skip to next NALU...
    {
        tempA = MP4_Bytes[ myPos+0 ];
        tempB = MP4_Bytes[ myPos+1 ];
        tempC = MP4_Bytes[ myPos+2 ];
        tempD = MP4_Bytes[ myPos+3 ];
        
        //# concat into one 32-bit integer
        size_NALU = ( tempA << 24 | tempB << 16 | tempC << 8 | tempD );
        
        //# update to new position (is the new "skip to" point)
        myPos += (size_NALU + 4); //# must add +4 to account for the extra four bytes of SIZE's integer
        
        //# While loop will repeat until an ELSE is triggered
        //# Can add safety by having an IF to stop whenever myPos is past/larger than the total bytes length. 
        
    }
    else
    {
        //# stop if keyframe is found
        Console.WriteLine( "## Found a Keyframe at offset: " + myPos );
        break;
    }
}