pythonopencvimage-processingvideo-processing

Shaky zoom with opencv python


I want to apply zoom in and out effect on a video using opencv, but as opencv doesn't come with built in zoom, I try cropping the frame to the interpolated value's width, height, x and y and than resize back the frame to the original video size, i.e. 1920 x 1080.

But when I rendered the final video, there is shakiness in the final video. I am not sure why its happening, I want a perfect smooth zoom in and out from the specific time

I built a easing function that would give interpolated value for each frame for zoom in and out :-

import cv2


video_path = 'inputTest.mp4'
cap = cv2.VideoCapture(video_path)

fps = int(cap.get(cv2.CAP_PROP_FPS)) 
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
out = cv2.VideoWriter('output_video.mp4', fourcc, fps, (1920, 1080))

initialZoomValue={
     'initialZoomWidth': 1920,
    'initialZoomHeight': 1080,
    'initialZoomX': 0,
    'initialZoomY': 0
}

desiredValues = {
     'zoomWidth': 1672,
     'zoomHeight': 941,
     'zoomX': 200,
     'zoomY': 0
}

def ease_out_quart(t):
    return 1 - (1 - t) ** 4

async def zoomInInterpolation(initialZoomValue, desiredZoom, start, end, index):
    t = (index - start) / (end - start)
    eased_t = ease_out_quart(t)

    interpolatedWidth = round(initialZoomValue['initialZoomWidth'] + eased_t * (desiredZoom['zoomWidth']['width'] - initialZoomValue['initialZoomWidth']), 2)
    interpolatedHeight = round(initialZoomValue['initialZoomHeight'] + eased_t * (desiredZoom['zoomHeight'] - initialZoomValue['initialZoomHeight']), 2)
    interpolatedX = round(initialZoomValue['initialZoomX'] + eased_t * (desiredZoom['zoomX'] - initialZoomValue['initialZoomX']), 2)
    interpolatedY = round(initialZoomValue['initialZoomY'] + eased_t * (desiredZoom['zoomY'] - initialZoomValue['initialZoomY']), 2)
    
    return {'interpolatedWidth': int(interpolatedWidth), 'interpolatedHeight': int(interpolatedHeight), 'interpolatedX': int(interpolatedX), 'interpolatedY': int(interpolatedY)}

def generate_frame():
        while cap.isOpened():
            code, frame = cap.read()
            if code:
                yield frame
            else:
                print("bailsdfing")
                break


for i, frame in enumerate(generate_frame()):
   if i >= 1 and i <= 60:
        interpolatedValues = zoomInInterpolation(initialZoomValue, desiredValues, 1, 60, i)
        crop = frame[interpolatedValues['interpolatedY']:(interpolatedValues['interpolatedHeight'] + interpolatedValues['interpolatedY']), interpolatedValues['interpolatedX']:(interpolatedValues['interpolatedWidth'] + interpolatedValues['interpolatedX'])]
        zoomedFrame = cv2.resize(crop,(1920, 1080), interpolation = cv2.INTER_CUBIC) 

        out.write(zoomedFrame)

# Release the video capture and close windows
cap.release()
cv2.destroyAllWindows()

But the final video I get is shaking :-

Final Video

I want the video to be perfectly zoom in and out, don't want to any shakiness


Here is graph of the interpolated values :- GRAPH

This is a graph if I don't round the number too early and return the integer value only :- GRAPH for Integer return

As OpenCV would only accept whole numbers for cropping, its not possible to return the values from the interpolation function in decimals


Solution

  • First, let's look at why your approach jitters. Then I'll show you an alternative that doesn't jitter.

    In your approach, you zoom by first cropping the image, and then resizing it. That cropping only happens by whole pixel rows/columns, not in finer steps. You saw that especially well near the end of the ease, where the image is zoomed very finely. The cropped image would change width/height by less than a pixel per frame, so it only changes every couple of frames. The jerkiness would worsen the closer you zoom because then a pixel becomes larger.

    Instead of cropping like this, calculate and apply a transform matrix for every frame. This involves warpAffine() or warpPerspective(). Considering the source image to be a texture, these functions run over every destination pixel, use the transform matrix to calculate the point in the source image, then sample that in the source image, with some interpolation mode.

    The compound transform is calculated from three primitive transforms:

    1. move a particular point (the "anchor") of the image to the origin
    2. scale around origin (the anchor)
    3. move to where it should be in the video frame

    In the code, I build this transform in this sequence. You could also write that in a single expression as T = translate2(*+zoom_center) @ scale2(s=z) @ translate2(*-anchor). Yes, the operations expressed by these matrices are applied from right to left.

    import numpy as np
    import cv2 as cv
    from tqdm import tqdm # remove that if you don't like it
    
    # Those two functions generate simple translation and scaling matrices:
    
    def translate2(tx=0, ty=0):
        T = np.eye(3)
        T[0:2, 2] = [tx, ty]
        return T
    
    def scale2(s=1, sx=1, sy=1):
        T = np.diag([s*sx, s*sy, 1])
        return T
    
    # you know this one already
    
    def ease_out_quart(alpha):
        return 1 - (1 - alpha) ** 4
    
    # some constants to describe the zoom
    
    im = cv.imread(cv.samples.findFile("starry_night.jpg"))
    (imheight, imwidth) = im.shape[:2]
    
    (output_width, output_height) = (1280, 720)
    
    fps = 60
    duration = 5.0 # secs
    
    # "anchor": somewhere in the image
    anchor = np.array([ (imwidth-1) * 0.75, (imheight-1) * 0.75 ])
    # position: somewhere in the frame
    zoom_center = np.array([ (output_width-1) * 0.75, (output_height-1) * 0.75 ])
    
    zoom_t_start, zoom_t_end = 1.0, 4.0
    zoom_z_start, zoom_z_end = 1.0, 10.0
    
    # calculates the matrix:
    
    def calculate_transform(timestamp):
        alpha = (timestamp - zoom_t_start) / (zoom_t_end - zoom_t_start)
        alpha = np.clip(alpha, 0, 1)
        alpha = ease_out_quart(alpha)
        z = zoom_z_start + alpha * (zoom_z_end - zoom_z_start)
    
        T = translate2(*-anchor)
        T = scale2(s=z) @ T
        T = translate2(*+zoom_center) @ T
    
        return T
    
    # applies the matrix:
    
    def animation_callback(timestamp, canvas):
        T = calculate_transform(timestamp)
        cv.warpPerspective(
            src=im,
            M=T,
            dsize=(output_width, output_height),
            dst=canvas, # drawing over the same buffer repeatedly
            flags=cv.INTER_LANCZOS4, # or INTER_LINEAR, INTER_NEAREST, ...
        )
    
    # generate the video
    
    writer = cv.VideoWriter(
        filename="output.avi",  # AVI container: OpenCV built-in
        fourcc=cv.VideoWriter_fourcc(*"MJPG"), # MJPEG codec: OpenCV built-in
        fps=fps,
        frameSize=(output_width, output_height),
        isColor=True
    )
    assert writer.isOpened()
    
    canvas = np.zeros((output_height, output_width, 3), dtype=np.uint8)
    
    timestamps = np.arange(0, duration * fps) / fps
    
    try:
        for timestamp in tqdm(timestamps):
            animation_callback(timestamp, canvas)
    
            writer.write(canvas)
    
            cv.imshow("frame", canvas)
    
            key = cv.waitKey(1)
            if key in (13, 27): break
    
    finally:
        cv.destroyWindow("frame")
        writer.release()
        print("done")
    

    Here are several result videos.

    https://imgur.com/a/mDfrpre

    One of them is a very close zoom with nearest neighbor interpolation, so you can see the pixels clearly. As you can see, the image isn't "cropped" to whole pixels. You can see pieces of pixels at the edges.

    For fun I also made one video with multiple animation segments (pause, in, pause, out, pause). That merely needs some logic in calculate_transform to figure out what time segment you're in.

    zoom_keyframes = [ # (time, zoom)
        (0.0, 15.0),
        (1.0, 15.0),
        (2.0, 16.0),
        (3.0, 16.0),
        (4.0, 15.0),
        (5.0, 15.0),
    ]
    
    def calculate_transform(timestamp):
        i0 = i1 = 0
        for i, (tq, _) in enumerate(zoom_keyframes):
            if tq <= timestamp:
                i0 = i
            if timestamp <= tq:
                i1 = i
                break
    
        if i1 == i0: i1 = i0 + 1
    
        # print(f"i0 {i0}, i1 {i1}")
    
        zoom_ta, zoom_za = zoom_keyframes[i0]
        zoom_tb, zoom_zb = zoom_keyframes[i1]
    
        alpha = (timestamp - zoom_ta) / (zoom_tb - zoom_ta)
        alpha = np.clip(alpha, 0, 1)
        alpha = ease_out_quart(alpha)
        z = zoom_za + alpha * (zoom_zb - zoom_za)
    
        T = translate2(*-anchor)
        T = scale2(s=z) @ T
        T = translate2(*+zoom_center) @ T
    
        return T