pythonpdfpython-pdfkit

How to conserve the pdf layout after converting content from English to French using Python


I am working on a simple application which will help me to convert all my pdf files which have text in English to French text as pdf. I have worked on a simple proof of concept which helps me to iterate over the given file and convert all text into French. Now I am stuck on saving the converted french text into a pdf with a similar structure of the original English version.

import PyPDF2
from googletrans import Translator
translator = Translator()

read_pdf = PyPDF2.PdfFileReader(open('any_english.pdf', 'rb'))
write_pdf = PyPDF2.PdfFileWriter()
number_of_pages = read_pdf.getNumPages()

for i in range(number_of_pages):
    page = read_pdf.getPage(i)
    page_content = page.extractText()
    print translator.translate(page_content, dest='fr').text

    // Save the converted version text in french into a pdf conserving structure as original pdf

**Note

All contents in the pdf are text format not image.


Solution

  • There are no easy ways to open, edit and rewrite pdfs in Python. However, depending on the complexity of the PDF/structure you might have success converting the PDF to HTML, translating and then generating a PDF from the HTML.

    For converting PDF to HTML, there is pdf2html which has a basic Python wrapper.

    Once the translation is done you can reverse this process with various degrees of success using e.g. weasyprint, html2pdf (Mac only), wkhtmltopdf (requires Qt).