I need to capture .doc or .docx files from external sites, convert them to pdf and return the content. To this I add a content-type header, publish through my CMS, cache by CDN, and display within HTML using the Adobe PDF Embed API. I'm using Python 3.7.
As a test, this works:
def generate_pdf():
subprocess.call(['soffice', '--convert-to', 'pdf',
'https://arbitrary.othersite.com/anyfilename.docx'])
sleep(1)
myfile = open('anyfilename.pdf', 'rb')
content = myfile.read()
os.remove('anyfilename.pdf')
return content
This would be nice:
def generate_pdf(url):
result = subprocess.call(['soffice', '--convert-to', 'pdf', url])
content = result
return content
The URLs could include any parameters or illegal characters, which might make it hard to guess the resulting file name. Anyway, it would be preferable not to have to sleep, save, read, and delete the converted file.
Is this possible?
I don't think soffice supports outputting to stdout so you don't have many choices. If you output to a temporary directory, you can use listdir to get the filename though:
import subprocess
import tempfile
import os
url = "https://www.usariem.army.mil/assets/docs/journal/Lieberman_DS_survey_and_guidelines.docx"
with tempfile.TemporaryDirectory() as tmpdirname:
subprocess.run(["soffice", '--convert-to', 'pdf', "--outdir", tmpdirname, url], cwd="/")
files = os.listdir(tmpdirname)
if files:
print(files[0])