pythongarbage-collectionpdfplumber

pdfplumber memory hogging (crash with large pdf files)


Using pdfplumber to extract text from large pdf files crashes it.

with pdfplumber.open("data/my.pdf") as pdf:
    for page in pdf.pages:
        **do something**

Solution

  • Solution found on: https://github.com/jsvine/pdfplumber/issues/193 New:

    with pdfplumber.open("data/my.pdf") as pdf:
        for page in pdf.pages:
            run_my_code()
            page.flush_cache()
    

    Old:

    with pdfplumber.open("data/my.pdf") as pdf:
        for page in pdf.pages:
            run_my_code()
            del page._objects
            del page._layout
    

    These two seem like are the one with the most responsibility for hogging the memory after each loop, deleting it can assist not hogging the computer memory.

    If this does not work please try forcing the garbage collector to clean them.

    import gc
    with pdfplumber.open("data/my.pdf") as pdf:
        for page in pdf.pages:
            run_my_code()
            del page._objects
            del page._layout
            gc.collect()