Efficiently Removing a Single Page from a Large Multi-page TIFF with JPEG Compression in Python

I am working with a large multi-page TIFF file that is JPEG-compressed, and I need to remove a single page from it. I am using the tifffile Python package to process the TIFF, and I already know which page I want to remove based on metadata tags associated with that page. My current approach is to read all pages, modify the target page (either by skipping or replacing it), and write the rest back to a new TIFF file.

Here’s what I’ve tried so far:

import tifffile

with tifffile.TiffFile('file') as tif:
    for i, page in enumerate(tif.pages):
        if some condition with tags is true:
            # Skip the page to delete or replace with a dummy page

        image_data = page.asarray(memmap=True)  # Memory-mapped access to the page's data

        # Write the page to the output file
        writer.write(
            image_data,
            compression='jpeg',
            photometric=page.photometric,
            metadata=page.tags,
        )

However, this approach has several issues:

Memory Usage: Processing a large file consumes almost all available memory (I have 32GB of RAM, but it uses up to 28GB), which makes it unfeasible for large files.
Compression Issues: Different compression methods like LZW, ZSTD, and JPEG create files of vastly different sizes, and some are much larger than the original.
Performance: Using methods like strips or chunking leads to very slow processing, taking too long to delete a single page.
Output file size: The size of the output file with using a different compression method makes it too big! (3GB Input on JPEG to 50GB+ output on LZW)

Is there any way in Python to efficiently remove a single page from a large multi-page TIFF file without consuming too much memory or taking forever? I’ve seen some .NET packages that can delete a page in-place—does Python have a similar solution?

Solution

I've created a Python package to handle it. While it can be made more extensible, it efficiently solves the problem without loading all the image data into memory.

Core Idea:

The package works by:

Reconstructing the IFD (Image File Directory) chain: It removes the IFD of concern while keeping references to the original image data.
Adjusting Metadata: The metadata pointing to the "label" information is updated implicitly during the reconstruction process. Memory
Efficiency: By referencing actual image data, the package avoids the need to load large image files into memory.

Installation: You can install the package directly from PyPI:

pip install tiff-wsi-label-removal

Usage: Once installed, you can use the remove-label command-line tool to remove labels from a TIFF file:

remove-label <input_tiff_file> <output_tiff_file>

Current Limitations and Future Work:

The package is functional, but there’s room for improvement, including making it more extensible.

Suggestions from the comments here and additional planned features are being tracked in the description of the package's PyPI page.

Any feedback is welcome!