pythonopencvwebcamvideo-capturelow-latency

Lower latency from webcam cv2.VideoCapture


I'm building an application using the webcam to control video games (kinda like a kinect). It uses the webcam (cv2.VideoCapture(0)), AI pose estimation (mediapipe), and custom logic to pipe inputs into dolphin emulator.

The issue is the latency. I've used my phone's hi-speed camera to record myself snapping and found latency of around 32 frames ~133ms between my hand and the frame onscreen. This is before any additional code, just a loop with video read and cv2.imshow (about 15ms)

Is there any way to decrease this latency?

I'm already grabbing the frame in a separate Thread, setting CAP_PROP_BUFFERSIZE to 0, and lowering the CAP_PROP_FRAME_HEIGHT and CAP_PROP_FRAME_WIDTH, but I still get ~133ms of latency. Is there anything else I can be doing?

Here's my code below:

class WebcamStream:
    def __init__(self, src=0):
        self.stopped = False

        self.stream = cv2.VideoCapture(src)
        self.stream.set(cv2.CAP_PROP_BUFFERSIZE, 0)
        self.stream.set(cv2.CAP_PROP_FRAME_HEIGHT, 400)
        self.stream.set(cv2.CAP_PROP_FRAME_WIDTH, 600)

        (self.grabbed, self.frame) = self.stream.read()
    
        self.hasNew = self.grabbed
        self.condition = Condition()

    def start(self):

        Thread(target=self.update, args=()).start()
        return self

    def update(self,):
        while True:
            if self.stopped: return
            
            (self.grabbed, self.frame) = self.stream.read()
            with self.condition:
                self.hasNew = True
                self.condition.notify_all()
            

    def read(self):
        if not self.hasNew:
            with self.condition:
                self.condition.wait()

        self.hasNew = False
        return self.frame

    def stop(self):
        self.stopped = True

The application needs to run in as close to real time as possible, so any reduction in latency, no matter how small would be great. Currently between the webcam latency (~133ms), pose estimation and logic (~25ms), and actual time it takes to move into the correct pose, it racks up to about 350-400ms of latency. Definitely not ideal when I'm trying to play a game.

EDIT: Here's the code I used to test the latency (Running the code on my laptop, recording my hand and screen, and counting frame difference in snapping):

if __name__ == "__main__":
    cap = WebcamStream().start()
    while(True):
        frame = cap.read()
        cv2.imshow('frame', frame)
        cv2.waitKey(1)

Solution

  • Welcome to the War-on-Latency ( shaving-off )

    The experience you have described above is a bright example, how accumulated latencies could devastate any chances to keep a control-loop tight-enough, to indeed control something meaningfully stable, as in a MAN-to-MACHINE-INTERFACE system we wish to keep:

    User's-motion | CAM-capture | IMG-processing | GUI-show | User's-visual-cortex-scene-capture | User's decision+action | loop

    A real-world situation, where OpenCV profiling was shown, to "sense" how much time we spend in respective acquisition-storage-transformation-postprocessing-GUI pipeline actual phases ( zoom in as needed )

    enter image description here


    What latency-causing steps do we work with?

    Forgive, for a moment, a raw-sketch of where we accumulate each of the particular latency-related costs :

    
       CAM \____/                                     python code GIL-awaiting ~ 100 [ms] chopping
            |::|                                      python code calling a cv2.<function>()
            |::|   __________________________________________-----!!!!!!!-----------
            |::|    ^     2x                                 NNNNN!!!!!!!MOVES DATA!
            |::|    | per-call                               NNNNN!!!!!!!    1.THERE
            |::|    |     COST                               NNNNN!!!!!!!    2.BACK
            |::|    |           TTTT-openCV::MAT into python numpy.array
            |::|    |          ////       forMAT TRANSFORMER TRANSFORMATIONS
            USBx    |         ////                           TRANSFORMATIONS
            |::|    |        ////                            TRANSFORMATIONS
            |::|    |       ////                             TRANSFORMATIONS
            |::|    |      ////                              TRANSFORMATIONS
            |::|    |     ////                               TRANSFORMATIONS
        H/W oooo   _v____TTTT in-RAM openCV::MAT storage     TRANSFORMATIONS
           /    \        oooo ------ openCV::MAT object-mapper
           \    /        xxxx
     O/S--- °°°°         xxxx
     driver """" _____   xxxx
             \\\\    ^   xxxx ...... openCV {signed|unsigned}-{size}-{N-channels}
     _________\\\\___|___++++ __________________________________________
     openCV I/O      ^   PPPP                                 PROCESSING
                as F |   ....                                 PROCESSING
                   A |   ...                                  PROCESSING
                   S |   ..                                   PROCESSING
                   T |   .                                    PROCESSING
                as   |   PPPP                                 PROCESSING
          possible___v___PPPP _____ openCV::MAT NATIVE-object PROCESSING
    
    

    What latencies do we / can we fight ( here ) against?

    Hardware latencies could help, yet changing already acquired hardware could turn expensive

    Software latencies of already latency-optimised toolboxes is possible, yet harder & harder

    Design inefficiencies are the final & most common place, where latencies could get shaved-off


    OpenCV ?
    There is not much to do here. The problem is with the OpenCV-Python binding details:

    ... So when you call a function, say res = equalizeHist(img1,img2) in Python, you pass two numpy arrays and you expect another numpy array as the output. So these numpy arrays are converted to cv::Mat and then calls the equalizeHist() function in C++. Final result, res will be converted back into a Numpy array. So in short, almost all operations are done in C++ which gives us almost same speed as that of C++.

    This works fine "outside" a control-loop, not in our case, where both of the two transport-costs, transformation-costs and any of new or interim-data storage RAM-allocation-costs result in worsening our control-loop TAT.

    So avoid any and all calls of OpenCV-native functions from Python-(behind the bindings' latency extra-miles)-side, no matter how tempting or sweet these may look on the first sight.

    HUNDREDS-of-[ms] are a rather bitter cost of ignoring this advice.


    Python ?
    Yes, Python. Using Python interpreter introduces both latency per se, plus adds problems with concurrency-avoided processing, no matter how many cores does our hardware operate on ( while recent Py3 tries a lot to lower these costs under the interpreter-level software).

    We can test & squeeze max out of the (still unavoidable, in 2022) GIL-lock interleaving - check the sys.getswitchinterval() and test increasing this amount for having less interleaved python-side processing ( tweaking is dependent on other your python-application ambitions ( GUI, tasks, python network-I/O workloads, python-HW-I/O-s, if applicable, etc )


    RAM-memory-I/O costs ?
    Our next major enemy. Using a least-sufficient-enough image-DATA-format, that MediaPipe can work with is the way forward in this segment.


    Avoidable losses
    All other (our) sins belong to this segment. Avoid any image-DATA-format transformations ( see above, cost may easily grow into HUNDREDS THOUSANDS of [us] just for converting an already acquired-&-formatted-&-stored numpy.array into just another colourmap)

    MediaPipe
    lists enumerated formats it can work with:

     // ImageFormat
    
      SRGB: sRGB, interleaved:   one byte for R,
                            then one byte for G,
                            then one byte for B for each pixel.
    
      SRGBA: sRGBA, interleaved: one byte for R,
                                 one byte for G,
                                 one byte for B,
                                 one byte for alpha or unused.
    
      SBGRA: sBGRA, interleaved: one byte for B,
                                 one byte for G,
                                 one byte for R,
                                 one byte for alpha or unused.
    
      GRAY8:        Grayscale,   one byte per pixel.
    
      GRAY16:       Grayscale,   one uint16 per pixel.
    
      SRGB48:  sRGB,interleaved, each component is a uint16.
    
      SRGBA64: sRGBA,interleaved,each component is a uint16.
    
      VEC32F1:                   One float per pixel.
    
      VEC32F2:                   Two floats per pixel.
    

    So, choose the MVF -- the minimum viable format -- for gesture-recognition to work and downscale the amount of pixels as possible ( 400x600-GRAY8 would be my hot candidate )

    Pre-configure ( not missing the cv.CAP_PROP_FOURCC details ) the native-side OpenCV::VideoCapture processing to do no more than just plain storing this MVF in a RAW-format on the native-side of the Acquisition-&-Pre-processing chain, so that no other post-process formatting takes place.

    If indeed forced to ever touch the python-side numpy.array object, prefer to use vectorised & striding-tricks powered operations over .view()-s or .data-buffers, so as to avoid any unwanted add-on latency costs increasing the control-loop TAT.


    Options?