c++gstreamerv4l2nvidia-jetsonh.265

How to reduce Gstreamer Latency?


I wrote a pipeline that grabs a 720 X 576 image from a 1920 X 576 sensor with the v4l2src element on a Nvidia jetson xavier nx. The pipeline grabs the frame and then does 2 things:

  1. pushes the frame to the appsink element
  2. encode it and stream with udpsink to the client

The pipeline is as follows:

gst-launch-1.0 v4l2src device=/dev/video0 ! 
queue max-size-time=1000000 ! 
videoconvert n-threads=8 ! 
video/x-raw,format=I420,width=1920,height=576 ! 
videoscale n-threads=8 method=0 ! 
video/x-raw,format=I420,width=720,height=576 ! 
tee name=t  ! queue ! 
valve name=appValve drop=false ! 
appsink name=app_sink t. ! 
queue max-size-time=1000000 ! 
videorate max-rate=25 ! 
nvvidconv name=nvconv ! 
capsfilter name=second_caps ! 
nvv4l2h265enc control-rate=1 iframeinterval=256 bitrate=1615000 peak-bitrate=1938000 preset-level=1 idrinterval=256 vbv-size=64600 maxperf-enable=true ! 
video/x-h265 ! 
h265parse config-interval=1 ! 
tee name=t2 ! 
queue max-size-time=1000000 ! 
valve name=streamValve drop=false ! 
udpsink host=192.168.10.56 port=5000 sync=false name=udpSinkElement

My question is:

Is there any way to reduce the latency of this pipeline?

I tried to reduce the latency by adding many queues and the n-threads to the video scale and videoconvert but it won't help.


Solution

  • How do you measure latency ?

    If it's the time it takes for seeing the video while launching the client, then you'd probably need to reduce the GOP (iframeinterval) because the client will wait for an I-Frame (or many I-slice) before being able to reconstruct a complete picture.

    You can easily see if it's the case, by capturing the output of the decoded video stream on a monitor with your video source pointed at a timer. Take a picture with your phone of both (monitor + timer) and you have a pretty good measure of glass to glass latency.

    You'll launch the stream multiple time and measure the latency. Why multiple time ? Because depending on where you are in the GOP (close or far from the next I-frame), it'll vary. Typically with a 25fps source, a GOP of 256 frames means that you could have from 0.04s to 10.2s of delay before being able to decode.

    Also, your pipeline is too complex for what you're trying to achieve. You can use nvvidconv (GPU) which is much better than videoscale (CPU) to rescale your video. You can use capabilities to set the framerate directly (no need for videorate). You can also limit the UDP sink's (and src) buffer size, where you're changing latency for reliability of reordered & late packets.

    There are other tricks to reduce latency, but you'll to loose something else. You can ask the encoder to decrease the GOP, you can ask no to use B-frame, you can enable slice level encoding, reduce the slice length, increase or decrease the slice intra refresh, limit the profile, etc... All have drawback from default settings, YMMV.

    Adding queues usually increases latency (but relieve the CPU hotspots) and doesn't reduce it, unless your queues are almost always empty and in that case, you don't need them. That's because a queue requires synchronizing threads and this takes time. This is only required if you have parallel processing on the data and the different branch aren't ticking at the same speed. In the "simple" sequential grab, encode, stream mode, the queue is usually not required (since most steps can be made on the GPU & NVENC and are not CPU limited). If you need synchronous I/O (like a filesink), then a queue can be beneficial IIF the processing time is sometimes higher than the grabber's sample rate.