pythonpackagepython-tesseractpoppler

Using anonympy to automatically anonymise PDFs


I am trying to use anonympy (https://github.com/ArtLabss/open-data-anonymizer) to anonymise PDFs. Unfortunately, the package doesn't seem to be too popular or well-documented, so there isn't a lot for me to go on except the sample code in the GitHub page documentation.

I was able to install the package with the command given in the answer to a question I previously asked (Problem when running terminal command: "pip install anonympy"). The code in the documentation for my use case is shown as follows:

from anonympy.pdf import pdfAnonymizer

# need to specify paths, since I don't have them in system variables
>>> anonym = pdfAnonymizer(path_to_pdf = "Downloads\\test.pdf",
                       pytesseract_path = r"C:\Program Files\Tesseract-OCR\tesseract.exe",
                       poppler_path = r"C:\Users\shakhansho\Downloads\Release-22.01.0-0\poppler-22.01.0\Library\bin")

# Calling the generic function
>>> anonym.anonymize(output_path = 'output.pdf',
                     remove_metadata = True,
                     fill = 'black',
                     outline = 'black')

I installed tesseract and poppler using Homebrew, and replicated the code as follows:

from anonympy.pdf import pdfAnonymizer



anonym = pdfAnonymizer(path_to_pdf = "test.pdf", 
                       pytesseract_path = "/opt/homebrew/Cellar/tesseract/5.3.4_1/bin/tesseract",
                       poppler_path = "/opt/homebrew/Cellar/poppler/24.04.0")

# Calling the generic function
anonym.anonymize(output_path = 'output.pdf',
                     remove_metadata = True,
                     fill = 'black',
                     outline = 'black')

After getting some errors about missing some packages, activating a virtual environment, installing those in there, and then trying to run the code, the error message I get now has me stumped:

FileNotFoundError: [Errno 2] No such file or directory: '/opt/homebrew/Cellar/poppler/24.04.0/pdfinfo'
pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

When I check the poppler folder, it does indeed have a binary called "pdfinfo".

Can anyone help me get further here? I understand that the sample code is clearly written on a Windows, and as you can see from where homebrew is installing stuff on my machine, I am on an M1 Macbook Pro.

I would also greatly appreciate anyone's general suggestions for using Python to automatically redact PDF files, especially ones like CVs which have a lot of personal information. I'm not trying to do this commercially, rather for a student initiative that I am a part of. We've tried an approach using regex to match applicants' form inputs to aspects of their CVs but this has proven to be buggy.


Solution

  • Thx for the question. I did tried to replicate your journey and also found it very frustrating. It's not clear how to add all paths correctly.

    Step 1. First of all to get the path to the poppler, where it was installed by the brew. Use the following command:

    brew info poppler
    

    enter image description here

    Step 2. Now, how to set up paths into the library. For the tesseract you need to provide the path straight to 'tesseract' executable and for the poppler just the path to the bin folder.

    Below is my code showing how to do this:

    from anonympy.pdf import pdfAnonymizer
    from pathlib import Path
    
    
    tesseract_path =  Path("/opt/homebrew/Cellar/tesseract/5.3.4/bin/tesseract")
    poppler_path = Path("/opt/homebrew/Cellar/poppler/24.02.0/bin")
    
    
    tt = Path("./input.pdf")
    tt = tt.resolve()
    
    anonym = pdfAnonymizer(path_to_pdf = str(tt), 
                           pytesseract_path = tesseract_path,
                           poppler_path = poppler_path)
    
    
    
    # Calling the generic function
    anonym.anonymize(output_path = 'output.pdf',
                         remove_metadata = True,
                         fill = 'black',
                         outline = 'black')
    
    

    Hope this helps.