I am creating a FMP4 with 2 tracks (one for video and one for audio). I trying to find out how many video samples should I include in the mdat and how many audio as well.
So my FMP4 has the following structure:
ftyp
moov
moof (track1 - video)
mdat (track1 - video)
moof (track2 - audio)
mdat (track2 - audio)
moof (track1 - video)
mdat (track1 - video)
moof (track2 - audio)
mdat (track2 - audio)
...
Should each video mdat have just 1 frame or an entire GOP?
Should each audio mdat have the respective audio samples of the previous video mdat or can I send as many audio samples I want (since audio samples are much smaller, I could send like 2 seconds of audio while the video mdat sent before has only 1 second of duration).
PS: I thought in sending the entire GOP on each video mdat but I noticed that when I recode that fmp4 using ffmpeg, it makes so that the mdat has only 1 frame. I can do that (have just 1 video frame on each mdat), but then I am lost on how many audio samples should I send. If I send only the respective audio samples for that 1 video frame, the audio doesn't play very well.
Thanks!
From several empirical testing using ffmpeg, it appears that it likes to group 0.5s of video frames followed by 0.5s of audio frames and that appears to work really well.