pythonpython-3.xubuntupdf-generationsoffice

Convert .doc/.docx to .pdf from URL, on-the-fly, with Python, on Linux


I need to capture .doc or .docx files from external sites, convert them to pdf and return the content. To this I add a content-type header, publish through my CMS, cache by CDN, and display within HTML using the Adobe PDF Embed API. I'm using Python 3.7.

As a test, this works:

def generate_pdf():
    subprocess.call(['soffice', '--convert-to', 'pdf',
                    'https://arbitrary.othersite.com/anyfilename.docx'])
    sleep(1)
    myfile = open('anyfilename.pdf', 'rb')
    content = myfile.read()
    os.remove('anyfilename.pdf')
    return content

This would be nice:

def generate_pdf(url):
    result = subprocess.call(['soffice', '--convert-to', 'pdf', url])
    content = result
    return content

The URLs could include any parameters or illegal characters, which might make it hard to guess the resulting file name. Anyway, it would be preferable not to have to sleep, save, read, and delete the converted file.

Is this possible?


Solution

  • I don't think soffice supports outputting to stdout so you don't have many choices. If you output to a temporary directory, you can use listdir to get the filename though:

    import subprocess
    import tempfile
    import os
    
    url = "https://www.usariem.army.mil/assets/docs/journal/Lieberman_DS_survey_and_guidelines.docx"
    with tempfile.TemporaryDirectory() as tmpdirname:
      subprocess.run(["soffice", '--convert-to', 'pdf', "--outdir", tmpdirname, url], cwd="/")
      files = os.listdir(tmpdirname)
      if files:
        print(files[0])