pythonpdfannotationskindle

Redacted / highlighted PDF becomes too big with this script. Can it be improved?


A few years ago I asked this question. I wanted to extract my Kindle annotations from the MyClippings.txt file and use them to annotate a PDF version of the original text. Very useful for academic reading (e.g., having the annotated original PDF is more useful for skimming and citing). A few months ago I found a solution in the following script.

import fitz

# the document to annotate
doc = fitz.open("text_to_highlight.pdf")

# the text to be marked
text_list = [
    "first piece of text", 
    "second piece of text",
    "third piece of text"
        ]

for page in doc:
    for text in text_list:
        rl = page.search_for(text, quads = True)
        page.add_highlight_annot(rl)

# save to a new PDF
doc.save("text_annotated.pdf")

I found however a new problem since then. The PDF output, on a 700 pages book, becomes incredibly big (more than 500M). (The script had to be run a few times,because with all the annotations at once it would crash; this is not necessarily a problem but it suggests inefficiency). Is there an approach---my guess is Python-based---that could prevent such inefficient outcome?


Solution

  • So, in case anybody gets here and is interested in this functionality, let me share the workflow and the Code (slightly changed / improved from the one above, but basically the same). Uesful when you've read in ePub but want to save your notes in a PDF for better skimming when doing research.

    Purpose

    To highlight a PDF using the MyClippings.txt file produced by the Kindle.

    Steps

    First, we need to extract from MyClippings the portions of text of the PDF we want to highlight. Fairly easy procedure, done manually. We can save the (rather long) lines in original_long_lines.txt.

    This is not enough: we want to cut those long lines into approximately five-words bits (for otherwise the search PDF function will not work properly). For that purpose we run the following Code (check input_file and output_file and name accordingly).

    def break_lines(input_file, output_file):
        with open(input_file, 'r') as file:
            lines = file.readlines()
    
        output_lines = []
        for line in lines:
            words = line.split()
            if len(words) >= 3:
                # Break line into new lines with a maximum of five words
                for i in range(0, len(words), 5):
                    output_line = ' '.join(words[i:i+7])
                    output_lines.append(output_line)
    
        with open(output_file, 'w') as file:
            file.write('\n'.join(output_lines))
    
        print(f"Output written to: {output_file}")
    
    
    # Example usage
    input_file = 'original_long_lines.txt'
    output_file = 'shorter_lines.txt'
    break_lines(input_file, output_file)
    
    

    Second, that is not enough either: you want to cut the lines where you only have one or two words (to prevent highlighting those two words all the time in the PDF). For that purpose, we use the following code:

    def join_lines(input_file, output_file):
        with open(input_file, 'r') as file:
            lines = file.readlines()
    
        output_lines = []
        prev_line = ''
    
        for line in lines:
            words = line.split()
            if len(words) <= 2:
                prev_line += ' ' + line.strip()
            else:
                output_lines.append(prev_line.strip())
                prev_line = line.strip()
    
        # Add the last line to the output
        output_lines.append(prev_line.strip())
    
        with open(output_file, 'w') as file:
            file.write('\n'.join(output_lines))
    
        print(f"Output written to: {output_file}")
    
    
    # Example usage
    input_file = 'shorter_lines.txt'
    output_file = 'shorter_lines_no_one_or_two_words.txt'
    join_lines(input_file, output_file)
    
    

    And finally, we use the following code to highlight the PDF using our shorter_lines_no_one_or_two_words.txt text file.

    import PyPDF2
    import fitz
    from tqdm import tqdm
    
    def highlight_pdf(pdf_path, text_file):
        # Load the list of strings from the text file
        with open(text_file, 'r') as file:
            search_strings = file.read().splitlines()
    
        # Open the PDF file
        pdf = fitz.open(pdf_path)
    
        # Initialize the progress bar
        progress_bar = tqdm(total=len(pdf), unit='page')
    
        for page_num in range(len(pdf)):
            page = pdf[page_num]
            for search_string in search_strings:
                text_instances = page.search_for(search_string, quads=True)
                for inst in text_instances:
                    # Highlight the found text
                    highlight = page.add_highlight_annot(inst)
    
            # Update the progress bar after processing each page
            progress_bar.update(1)
    
        # Close the progress bar
        progress_bar.close()
    
        # Save the modified PDF
        output_path = 'highlighted_' + pdf_path
        pdf.save(output_path)
        pdf.close()
    
        print(f"Highlighted PDF saved as: {output_path}")
    
    
    # Example usage
    pdf_path = 'your_pdf.pdf'
    text_file = 'shorter_lines_no_one_or_two_words.txt'
    highlight_pdf(pdf_path, text_file)
    

    In my experience, this sometimes increases the size of the final file exponentially, and sometimes it does not. This problem can be easily be solved using cpdf as mentioned by John Whitington above, as in cpdf -squeeze huge_pdf.pdf -o small_pdf.pdf. And now you have your Kindle highlights in your PDF.