pythonoutputpython-docxon-the-fly

Python get file from an app on-the-fly (without saving it in file system)


I want to let user submit a MS Word file to my app, process it with python-docx library and return it back. Since a file size might be big, I do not want to save it into the file system after processing but rather return it for download.

Get file from stream - this works

import docx
from docx.document import Document 
from StringIO import StringIO

source_stream = StringIO(request.vars['file'].value)
document = docx.Document(source_stream)
source_stream.close()
process_doc(document)

Return it as a stream - this does not work

The app makes indeed user to download file, but *MS Word can't open file, saying "because some part is missing or invalid".

def download(document, filename):
    import contenttype as c
    import cStringIO
    out_stream = cStringIO.StringIO()
    document.save(out_stream)  

    response.headers['Content-Type'] = c.contenttype(filename)
    response.headers['Content-Disposition'] = \
            "attachment; filename=%s" %  filename
    return out_stream.getvalue()

I've found Upload a StringIO object with send_file() but this persist to the flask framework. I rather use web2py framework.

Update 1

Some said about moving file pointer to the start of document data before sending it in output stream. But how to do it?

Update 2

As @scanny has suggested, I've created an empty file,

document = docx.Document()

and made it to download from file object using BytesIO module:

document = docx.Document() 
from io import BytesIO
out_stream = BytesIO()
document.save(out_stream)
filename = 'temporal_file.docx'
filepath = os.path.join(request.folder, 'uploads',filename )
try:
    with open(filepath, 'wb') as f:
        f.write(out_stream.getvalue())
    response.flash ='Success to open file for writing'
    response.headers['Content-Disposition'] = "attachment; filename=%s" % filename
    response.headers['Content-Type'] = 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
    #response['X-Sendfile'] = filepath
    #response['Content-Length'] = os.stat(filepath).st_size
    return  out_stream.getvalue()

As seen in the code, I also write that empty file into the file-system. And I could easily manually download it and open it in MS word: enter image description here

So, still the question is open why the downloaded MS Word file (thru the output stream) is damaged and cannot be opened by MS Word?

Update 3

I've eliminated python-docx from the process of file output into an out stream. And the result was the same: after the file download process one can't open it in MS Word. Code:

# we load without python-docx library
from io import BytesIO
try:
    filename = 'empty_file.docx'
    filepath = os.path.join(request.folder, 'uploads',filename )
    # read a file from file system (disk)
    with open(filepath, 'rb') as f: 
        out_stream = BytesIO(f.read())
    response.flash ='Success to open file for reading'
    response.headers['Content-Disposition'] = "attachment; filename=%s" % filename
    response.headers['Content-Type'] = 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
    return out_stream.getvalue()
except Exception as e:
    response.flash ='Error open file for reading or download it - ' + filename
return

Solution

  • I would start by saving to the file-like object and then copying that file-like object to a file (locally, without downloading it). That should bisect the range of where the problem is happening. By the way, I'd use BytesIO instead of StringIO. It might not make a difference in 2.7, but it could, and StringIO won't work in Python 3 in any case:

    from io import BytesIO
    
    # ... code that processes `document`
    out_stream = BytesIO()
    document.save(out_stream)
    with open('test.docx', 'wb') as f:
        f.write(out_stream.getvalue())
    

    If that doesn't work (test.docx won't open), you've narrowed the problem to "before" the document.save() call.

    If it does work, you can try the download again and see, but pay particular attention to the type expected as the return value from your download method. What you'd be getting here is a sequence of bytes. If it's expecting a file-like object or perhaps a path, that could be the problem too.

    Moving the file pointer to the start (using out_stream.seek(0)) would only be relevant if you were returning a file-like object, like return out_stream instead of return outstream.getvalue(). The latter returns bytes, which of course don't have a file pointer. BytesIO (or StringIO).getvalue() does not require setting the file cursor; it always returns the full contents of the object.

    Also, instead of relying on contenttype to get it right, I'd spell out the content-type header as: application/vnd.openxmlformats-officedocument.wordprocessingml.document. If contenttype misidentified the file as a .doc format (pre-Word 2007) file as opposed to a .docx format (Word 2007 and later) file, that could also cause a problem.