pythonpdfiodocxbytestream

Convert DOCX Bytestream to PDF Bytestream Python


I currently have a program that generates a .docx document using the python-docx library.

Upon completing the building of the .docx file I save it into a Bytestream as so

file_stream = io.BytesIO()
document.save(file_stream)
file_stream.seek(0)

Now, I need to convert this word document into a PDF. I have looked at a few different libraries for conversion such as docx2pdf or even doing it manually using comtypes as so

import sys
import os
import comtypes.client

wdFormatPDF = 17

in_file = "Input_file_path.docx"
out_file = "output_file_path.pdf"

word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open(in_file)
doc.SaveAs(out_file, FileFormat=wdFormatPDF)
doc.Close()
word.Quit()

The problem is, I need to do this conversion in memory and cannot physically save the DOCX or the PDF to the machine. Every converter I've seen requires a filepath to the physical document on the machine and I do not have that.

Is there a way I can convert the DOCX filestream into a PDF stream just in memory?

Thanks


Solution

  • This method is a little convoluted, but it works entirely in memory, and you get the option to add custom CSS to style the final document.
    Convert the DOCX bytestream to HTML using mammoth, and the resulting HTML to PDF using pdfkit.

    Here's an example

    # create a dummy docx file
    from docx import Document
    document = Document()
    document.add_paragraph('Lorem ipsum dolor sit amet.')
    
    # create a bytestream
    import io
    file_stream = io.BytesIO()
    document.save(file_stream)
    file_stream.seek(0)
    
    # convert the docx to html
    import mammoth
    result = mammoth.convert_to_html(file_stream)
    
    # >>> result.value
    # >>> '<p>Lorem ipsum dolor sit amet.</p>'
    
    # convert html to pdf
    import pdfkit
    pdf = pdfkit.from_string(result.value)
    

    If you want to output the stream to a file, just do

    with open('test.pdf','wb') as file:
        file.write(pdf)