A few years ago I asked this question. I wanted to extract my Kindle annotations from the MyClippings.txt
file and use them to annotate a PDF version of the original text. Very useful for academic reading (e.g., having the annotated original PDF is more useful for skimming and citing). A few months ago I found a solution in the following script.
import fitz
# the document to annotate
doc = fitz.open("text_to_highlight.pdf")
# the text to be marked
text_list = [
"first piece of text",
"second piece of text",
"third piece of text"
]
for page in doc:
for text in text_list:
rl = page.search_for(text, quads = True)
page.add_highlight_annot(rl)
# save to a new PDF
doc.save("text_annotated.pdf")
I found however a new problem since then. The PDF output, on a 700 pages book, becomes incredibly big (more than 500M). (The script had to be run a few times,because with all the annotations at once it would crash; this is not necessarily a problem but it suggests inefficiency). Is there an approach---my guess is Python-based---that could prevent such inefficient outcome?
So, in case anybody gets here and is interested in this functionality, let me share the workflow and the Code (slightly changed / improved from the one above, but basically the same). Uesful when you've read in ePub but want to save your notes in a PDF for better skimming when doing research.
To highlight a PDF using the MyClippings.txt
file produced by the Kindle.
First, we need to extract from MyClippings the portions of text of the PDF we want to highlight. Fairly easy procedure, done manually. We can save the (rather long) lines in original_long_lines.txt
.
This is not enough: we want to cut those long lines into approximately five-words bits (for otherwise the search PDF function will not work properly). For that purpose we run the following Code (check input_file
and output_file
and name accordingly).
def break_lines(input_file, output_file):
with open(input_file, 'r') as file:
lines = file.readlines()
output_lines = []
for line in lines:
words = line.split()
if len(words) >= 3:
# Break line into new lines with a maximum of five words
for i in range(0, len(words), 5):
output_line = ' '.join(words[i:i+7])
output_lines.append(output_line)
with open(output_file, 'w') as file:
file.write('\n'.join(output_lines))
print(f"Output written to: {output_file}")
# Example usage
input_file = 'original_long_lines.txt'
output_file = 'shorter_lines.txt'
break_lines(input_file, output_file)
Second, that is not enough either: you want to cut the lines where you only have one or two words (to prevent highlighting those two words all the time in the PDF). For that purpose, we use the following code:
def join_lines(input_file, output_file):
with open(input_file, 'r') as file:
lines = file.readlines()
output_lines = []
prev_line = ''
for line in lines:
words = line.split()
if len(words) <= 2:
prev_line += ' ' + line.strip()
else:
output_lines.append(prev_line.strip())
prev_line = line.strip()
# Add the last line to the output
output_lines.append(prev_line.strip())
with open(output_file, 'w') as file:
file.write('\n'.join(output_lines))
print(f"Output written to: {output_file}")
# Example usage
input_file = 'shorter_lines.txt'
output_file = 'shorter_lines_no_one_or_two_words.txt'
join_lines(input_file, output_file)
And finally, we use the following code to highlight the PDF using our shorter_lines_no_one_or_two_words.txt
text file.
import PyPDF2
import fitz
from tqdm import tqdm
def highlight_pdf(pdf_path, text_file):
# Load the list of strings from the text file
with open(text_file, 'r') as file:
search_strings = file.read().splitlines()
# Open the PDF file
pdf = fitz.open(pdf_path)
# Initialize the progress bar
progress_bar = tqdm(total=len(pdf), unit='page')
for page_num in range(len(pdf)):
page = pdf[page_num]
for search_string in search_strings:
text_instances = page.search_for(search_string, quads=True)
for inst in text_instances:
# Highlight the found text
highlight = page.add_highlight_annot(inst)
# Update the progress bar after processing each page
progress_bar.update(1)
# Close the progress bar
progress_bar.close()
# Save the modified PDF
output_path = 'highlighted_' + pdf_path
pdf.save(output_path)
pdf.close()
print(f"Highlighted PDF saved as: {output_path}")
# Example usage
pdf_path = 'your_pdf.pdf'
text_file = 'shorter_lines_no_one_or_two_words.txt'
highlight_pdf(pdf_path, text_file)
In my experience, this sometimes increases the size of the final file exponentially, and sometimes it does not. This problem can be easily be solved using cpdf
as mentioned by John Whitington above, as in cpdf -squeeze huge_pdf.pdf -o small_pdf.pdf
. And now you have your Kindle highlights in your PDF.