pythonopencvimage-processingocr

How to crop image in OpenCV to keep only the white page with text, excluding darker borders?


I have scanned pages where the page is surrounded by gray/dark borders. I want to crop the image so that only the white page with the text remains — without any parts of the darker borders.

Most of the solutions I found use contour detection, which gives me the outer bounding box around the page. But this still includes the dark borders.

Example (illustration):

Original scan: white page + gray border

Current result: bounding box includes both page and border

enter image description here

Desired result: crop only the inner white page (with text), no gray border

enter image description here

What I tried so far:

import cv2

img = cv2.imread("scan.png", cv2.IMREAD_GRAYSCALE)

# Threshold

_, th = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)

# Find contours

contours, _ = cv2.findContours(th, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

# Bounding box

x, y, w, h = cv2.boundingRect(max(contours, key=cv2.contourArea))

cropped = img[y:y+h, x:x+w]

This works for finding the outer box, but it still contains the gray border.

Question:

How can I detect and crop only the inner white page with text, excluding the surrounding dark/gray border, using OpenCV?


Solution

  • If you really just want to crop the image as-is, then below is my take on the task.

    If you want to correct for the perspective too, not losing any or the margins of the document, then you'll need different approaches.


    Here is an approach that iteratively crops 1-pixel lines off the side of a bounding box. In every iteration, it removes the side with the least amount of white.

    The input, which I manually cut out of your picture of a plot:

    input

    I binarize the image with Otsu to obtain a mask of "page" and "background". That makes everything easier.

    There's some stuff beyond the top left corner of the page (bottom right corner in the picture) that I need to discard. Towards that goal, I'll label the connected components, then pick the largest connected component.

    binarized largest CC

    I need to count the number of black and white pixels along every edge of the bounding box. This is summing over a rectangular area (that is one pixel wide here). To make this calculation fast, I'm using an Integral Image. It's not much to look at, basically a gradient. The values are important though. They let me look up the sum of pixels of any rectangle in the image. I'll use that to calculate the number of out (black) pixels of the sides of the bounding box, which will shrink and move.

    Calculating the Integral Image has its own cost. It has to read all pixels. Shrinking the bounding box most probably will touch far less than half of all pixels, maybe just 10% of them... but it'll touch most pixels repeatedly unless there's complex logic to avoid that. It's a trade-off. Using the integral image is conceptually simpler and it's definitely cheaper to do once in an optimal memory access pattern using machine code (OpenCV is written in C++) than to sum across many many slices using numpy.

    Preparatory code:

    im = cv.imread("thumbnail-from-the-plot.png", cv.IMREAD_GRAYSCALE)
    (height, width) = im.shape[:2]
    
    (_, mask) = cv.threshold(im, 220, 255, cv.THRESH_BINARY | cv.THRESH_OTSU)
    imshow(mask)
    
    (num_labels, labels, stats, _) = cv.connectedComponentsWithStats(mask)
    # label 0 is background
    # sort non-background by area (CC_STAT_AREA)
    sorted_idx = np.argsort(stats[1:, cv.CC_STAT_AREA]) + 1
    idx_largest = sorted_idx[-1]
    stats_largest = stats[idx_largest]
    mask_largest = (labels == idx_largest)
    
    intgr = cv.integral(mask_largest.astype(np.uint8))
    
    def integral_sum(x, y, w, h):
        return int(intgr[y+h, x+w] - intgr[y, x+w] - intgr[y+h, x] + intgr[y, x])
    

    Here comes the main part, evaluating the sides of the box, picking the best one to remove.

    I chose the side with the most black pixels for removal.

    The iteration stops when all sides are all-white, i.e. when no side has any black pixels left.

    This can run into trouble if any side is cropped off so much that it's running into the actual text on the page. Such text would offer black pixels. This can be fixed by applying a closing morphology operation, or by another round of connected components/contour finding on the holes in the page and then erasing those holes.

    bbox = stats_largest[[cv.CC_STAT_LEFT, cv.CC_STAT_TOP, cv.CC_STAT_WIDTH, cv.CC_STAT_HEIGHT]].tolist()
    # could also start with the whole image but this gives it a head start
    
    side_names = "top bottom left right".split()
    
    while True:
        (x, y, w, h) = bbox
    
        if w <= 2 or h <= 2: # pointless box
            break
    
        counts = np.array([w, w, h, h])
    
        in_top    = integral_sum(x,     y,     w, 1)
        in_bottom = integral_sum(x,     y+h-1, w, 1)
        in_left   = integral_sum(x,     y,     1, h)
        in_right  = integral_sum(x+w-1, y,     1, h)
        ins = np.array([in_top, in_bottom, in_left, in_right])
        outs = counts - ins
    
        if not (outs > 0).any(): # nothing to crop anymore (done)
            break
    
        side_index = np.argmax(outs)
    
        match side_index:
            case 0: bbox = (x,   y+1, w,   h-1)
            case 1: bbox = (x,   y,   w,   h-1)
            case 2: bbox = (x+1, y,   w-1, h)
            case 3: bbox = (x,   y,   w-1, h)
    

    A sequence of how the box shrinks:

    animaiton

    This took 65 cheap iterations.

    enter image description here