pythonnumpyopencvgaussianblur

Can't seem to recreate Gaussian blur using Numpy


I want to implement Gaussian Blur (similar to cv2.GaussianBlur) using only Numpy but I can't seem to understand how to tackle the alpha channel. My code:

# 1D Gaussian PD Function
def gauss(x, sigma):
    g= (1/(((2*np.pi)**0.5)*sigma))* np.exp((-x**2)/(2*sigma**2))
    return g
def gaussian_filter(image, sigma, padding=True):
    
    ks= 2*(3*sigma)+1    #Height/ Width of Kernel. Using a Rule of Thumb explained in one of the books.
    image_result= np.zeros((image.shape[0], image.shape[1], image.shape[2]))
    
    if padding == True   #Using the appropriate padding to preserve dimensions
        pad= int((ks-1)/2)
        image= np.pad(image, ((pad,pad),(pad,pad),(0,0)))
           
    k= np.arange(-(ks-1)/2, ks/2)   # A simple array of size ks centred around 0 
    kernel= gauss(k, sigma)         # The corresponding Gaussian PD values with mean 0.
    kernel= kernel[:, np.newaxis]   
    kernel= kernel@kernel.T         #Using property of 2D Gaussian i.e product of two 1D Gaussians 
    kernel=kernel[:,:, np.newaxis]  
    
    h= kernel     #Adding another dimension to match the number of channels of img
    for i in range(image.shape[2]-1):
        kernel= np.concatenate((kernel, h), axis=2)
           
    for i in range(image_result.shape[0]):    #Convolution step
        for j in range(image_result.shape[1]):
            temp= image[i:i+ks, j:j+ks, :]
            temp= temp*kernel
            image_result[i,j] = (1/(ks*ks))*np.sum(np.sum(temp, axis= 0), axis=0)
    '''    
    ad= np.ones((image_result.shape[0], image_result.shape[1],1))     # Not sure about alpha channel. Tried setting it to maximum value.
    image_result= np.concatenate((image_result[:,:,0:3],ad), axis=2)
    '''
    return image_result

I read through CV docs and Wikipedia to come up with this implementation. But as I increase the standard deviation, the image gets darker. Also, the result is not inline with produced by cv2.GaussianBlur which seeems to preserve the alpha channel and not convolve it. My results

Any help is deeply appreciated :D


Solution

  • Double normalization of the kernel

    You shoudn't divide by ks² the sum. That is something we do when the kernel is full of ones (kernel=np.ones((ks,ks))). Then, sure, since kernel sum is ks², multiplying by kernel and summing multiply the average value of image by ks². So, in such case, we need to divide. Or said otherwise np.ones((ks,ks)) isn't normalized. np.ones((ks,ks))/ks/ks is. But your gaussian is already (or almost so) normalized. kernel.sum() is already 1 (or almost so)

    More accurate way to normalize it

    Even without that /ks/ks, kernel sum is not perfectly 1. Nobody would have noticed without the /ks/ks. But while we are at it, let's do it right.

    Concerning my two "or almost so": gaussians are infinite. But you can't have infinite kernels, of course. So, usually, we keep just the middle part of the gaussian. The 3*sigma you refer to as coming from a book in a comment. That is because cumulative distribution function of gaussian is such as 99.7% of the total value of the gaussian is within [μ-3σ, μ+3σ]. But, then, that means that total value of kernel is not 1, but 0.997. Since kernel computation is done once for all, and not in loop, I would have its sum be really 1, by computing

    # Non normalized kernel — no need to divide by √(2π)σ, anyway, we will normalized it
    kernel = np.exp((-k[:,None]**2)/(2*sigma**2)) * np.exp((-k[None,:]**2)/(2*sigma**2))
    kernel /= kernel.sum(). # now kernel.sum() is 1
    

    Wrong dtype when displaying

    Because you divide by /ks², which, for σ=4 is 625 (ks=25), image is 625 times too dark for σ=4. But because of another error, the fact that you return an image whose pixels are floats, it is interpreted as 255 times more. Because when an image dtype is a byte, then pixels are supposed to range from 0 to 255. But when it is a float, they are expected to range from 0 to 1. Hence the reason why the image, because of that, appears 255 too brigth. And 625 times too dark because of /ks/ks, times 255 too bright because of float dtype, equals 2.45 times too dark. Which is quite realistic from what we see

    So, point is, you should return not image_result but image_result.astype(np.uint8). Or, if you want to keep it as floats, divide by 255.0

    Broadcasting

    You don't really need to duplicate kernel for all channels. You already added an axis. You could keep it as it is (that is as size 1 axis when you create it). Broadcasting would have given the same exact result.

    (I mean image[i:i+ks, j:j+ks,:]*kernel would work the same if kernel is a ks×ks×3 array with 3 identical ks×ks planes, or if it were a ks×ks×1 array)

    Vectorization.

    Of course, I take it is a pedagogical attempt, to try to check that you can reproduct cv2 blur. So performance is probably not the main issue.

    Still, there are some easy steps.

    For example no need to chain sum

    image_result[i,j] = np.sum(np.sum(temp, axis= 0), axis=0) # with (1/(ks*ks)) removed
    ⇔
    image_result[i,j] = np.sum(temp, axis=(0,1))
    

    Even more vectorization

    To vectorize even the convolution (numpy has no 2d convolution operator. scipy has, tho) you need a quite convuluted (no pun intended) way. But it is possible.

    If you build

    H,W,C=image_result.shape
    st1,st2,st3=image.strides
    blocks = np.lib.stride_tricks.as_strided(image, shape=(H,W,ks,ks,C), strides=(st1,st2,st1,st2,st3)
    

    Then block is a 5D array, that is a 3D array of ks×ks subimages

    Explanation: numpy array are just a bunch of contiguous data, and some metadata telling how to interpret them. Mainly shapes and strides. For example, data '0 2 3 4 5 6 7 8 9 10 11 11', with shape (3,4) and strides (4, 1) (assuming all data are bytes ; it doesn't matter, it is an example)., means that you consider those 12 numbers as a 12 array, of dimension 2, and shape (3,4), whose element arr[i,j] is the 4*i+1*jth number. So

    0 1  2  3
    4 5  6  7
    8 9 10 11
    

    The same data, with shape (2,2,3), and strides (6,3,1) is a 3D array whose element arr[i,j,k] is the 6*i+3*j+kth element. So

    [[[0, 1, 2], [3, 4, 5]],
     [[6, 7, 8], [9, 10, 11]]]
    

    And the strides and shape can be anything. Nothing says that they have to make number used once and only once and in order. As long as we can compute the address with a linear stride1*index1+stride2*index2+....

    For example, same data with shape (4,5,2) and strides (1,2,0) is a 4x5x2 array whose element arr[i,j,k] is the 2i+2j+kth element. So

    [[[0,0], [2, 2], [4, 4], [6, 6], [8, 8]],
     [[1,1], [3, 3], [5, 5], [7, 7], [9, 9]],
     [[2,2], [4, 4], [6, 6], [8, 8], [10, 10]],
     [[3,3], [5, 5], [7, 7], [9, 9], [11, 11]]]
    

    as_strided creates a new array, that share the same data (it is a view. No data is copied. So as_strided is basically free, both in CPU time and in memory usage) but with different shape and strides.

    So, my block is a 5D array, such as block[y0, x0, dy, dx, c] is the y0*st1+x0*st2+dy*st1+dx*st2+c*st3th pixel of image. Which is the same pixel as image[y0+dy, x0+dx, c], this image strides is st1, st2, st3

    What is the advantage of that. Well, it costed nothing. And now, to compute the convolution whose result is to be stored at pixel [y,x], you just have to (block[y,x]*kernel).sum(axis=(0,1))

    But more importantly (since, so far, block[y,x] it is just a fancy implementation of your image[y:y+ks,x:x+ks]), you can do it in one call, and let numpy handle for double for loop

    kernel = kernel[None,None,:,:,None] # a 5D kernel with size 1 axis
    image_result = (block*kernel).sum(axis=(2,3))
    

    With a warning tho: if your image is, say 1000x1000x3 (rgb), and σ=4, so ks=25, then block is a 1000x1000x25x25x3 array. That is almost 2 billions elements. No as bad as it sounds: it is not really memory. Memory is still just the 1000x1000x3 elements. It is just the view that has 2 billions indexations, but with many of them referring to the same data.

    But when you multiply by kernel, tho, that creates a result that is a new 1000x1000x25x25x3 array. And this times, it is not a view. There are actually 2 billions numbers in memory (the 2 billions result of the multiplication you also do in your code. Just, you don't store them all at once in memory. I do). And those numbers are float, since kernel is float. So, 8 bytes per nubers. 16 gigabytes of memory. I don't know if you have that much memory available, but I don't.

    But that is not really a problem: we can compute that by parts, in a for loop

    # That can be done once, it cost nothing
    H,W,C=image_result.shape
    st1,st2,st3=image.strides
    blocks = np.lib.stride_tricks.as_strided(image, shape=(H,W,ks,ks,C), strides=(st1,st2,st1,st2,st3)
    kernel = kernel[None,None,:,:,None] # a 5D kernel with size 1 axis
    
    for y in range(0, H, 20):
        image_result[y:y+20] = (block[y:y+20]*kernel).sum(axis=(2,3))
    

    I know that is removing a for loop, and then put it back. But, firstly it is a single one, not a double one. And secondly, it iterates lees (every 20 lines here. So, 50 iterations in my 1000x1000 examples, instead of 1000000. The remaining 20000 iterations are also done in a loop, but inside numpy's fast inner code).

    You can of course tune that 20 parameter. If you have a lot of memory, it can be quite high (and if you have dozens of Gb, you can even remove the whole loop). If on contrary you have no memory, you can even process things lines by lines as you were doing (so replacing that 20 by 1), keeping vectorization only for the x interation.