webmlive-videovp9

How should frames be packed in a live WebM stream?


I'm encoding a live stream with VP9 via libvpx and want to stream it over to a HTML5 player. I've read the Matroska specification and W3C WebM Byte Stream Format and examined a couple of WebM files generated by the vpxenc tool from libvpx. Everything seems nice, however I could not find any strict rules or guidelines on how to pack the encoded video frames inside the media segment described in the W3C specification.

As far as I understand I have to emit media segments that contain clusters with block elements inside. From what I understand I can use a simple block element for each frame I get from the encoder since it has a single timestamp. But how to organize clusters? For me it makes sense to emit a single cluster for each frame with a single simple block entry to reduce buffering and lag. Is such approach considered normal or are there any drawbacks to doing so and I should buffer for some time interval and then emit a cluster that contains multiple simple block elements covering the buffered time period?

UPDATE

So I implemented the described approach (emitting clusters with single simple block entry) and the video seems to lag a lot so presumably this is not the way to go.


Solution

  • So I finally managed to mux ar working live stream.

    It seems that the initial approach I described (having a single cluster with a single SimpleBlock) actually works as such, but it has several downsides:

    Key frames SHOULD be placed at the beginning of clusters

    One of my initial assumptions is that a Cluster cannot have an "unknown" size, but in practice it seemed out that Chrome, VLC and ffplay were happy with that and so there is no need to buffer a full GOP to determine the size and the Cluster can be emitted on the fly.

    Another important aspect is that the timestamps in the SimpleBlock elements are signed 16bit integers so you basically can encode an offset from the cluster timecode up to 32767 in that. So if you are using the default timescale where 1 tick is 1ms, this means a Cluster cannot be longer than 32 seconds. In case the GOP size is huge this criteria must also be taken into account when deciding whether to emit a new cluster.

    Finally, here is a link to a live stream (The "Big Buck Bunny" trailer, but in a live format) that seems to work with all the players and is generated as per the description above.

    Hope this information helps anyone.