I want to let user submit a MS Word file to my app, process it with python-docx library and return it back. Since a file size might be big, I do not want to save it into the file system after processing but rather return it for download.
Get file from stream - this works
import docx
from docx.document import Document
from StringIO import StringIO
source_stream = StringIO(request.vars['file'].value)
document = docx.Document(source_stream)
source_stream.close()
process_doc(document)
Return it as a stream - this does not work
The app makes indeed user to download file, but *MS Word can't open file, saying "because some part is missing or invalid".
def download(document, filename):
import contenttype as c
import cStringIO
out_stream = cStringIO.StringIO()
document.save(out_stream)
response.headers['Content-Type'] = c.contenttype(filename)
response.headers['Content-Disposition'] = \
"attachment; filename=%s" % filename
return out_stream.getvalue()
I've found Upload a StringIO object with send_file() but this persist to the flask framework. I rather use web2py framework.
Some said about moving file pointer to the start of document data before sending it in output stream. But how to do it?
As @scanny has suggested, I've created an empty file,
document = docx.Document()
and made it to download from file object using BytesIO
module:
document = docx.Document()
from io import BytesIO
out_stream = BytesIO()
document.save(out_stream)
filename = 'temporal_file.docx'
filepath = os.path.join(request.folder, 'uploads',filename )
try:
with open(filepath, 'wb') as f:
f.write(out_stream.getvalue())
response.flash ='Success to open file for writing'
response.headers['Content-Disposition'] = "attachment; filename=%s" % filename
response.headers['Content-Type'] = 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
#response['X-Sendfile'] = filepath
#response['Content-Length'] = os.stat(filepath).st_size
return out_stream.getvalue()
As seen in the code, I also write that empty file into the file-system. And I could easily manually download it and open it in MS word:
So, still the question is open why the downloaded MS Word file (thru the output stream) is damaged and cannot be opened by MS Word?
I've eliminated python-docx
from the process of file output into an out stream. And the result was the same: after the file download process one can't open it in MS Word. Code:
# we load without python-docx library
from io import BytesIO
try:
filename = 'empty_file.docx'
filepath = os.path.join(request.folder, 'uploads',filename )
# read a file from file system (disk)
with open(filepath, 'rb') as f:
out_stream = BytesIO(f.read())
response.flash ='Success to open file for reading'
response.headers['Content-Disposition'] = "attachment; filename=%s" % filename
response.headers['Content-Type'] = 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
return out_stream.getvalue()
except Exception as e:
response.flash ='Error open file for reading or download it - ' + filename
return
I would start by saving to the file-like object and then copying that file-like object to a file (locally, without downloading it). That should bisect the range of where the problem is happening. By the way, I'd use BytesIO instead of StringIO. It might not make a difference in 2.7, but it could, and StringIO won't work in Python 3 in any case:
from io import BytesIO
# ... code that processes `document`
out_stream = BytesIO()
document.save(out_stream)
with open('test.docx', 'wb') as f:
f.write(out_stream.getvalue())
If that doesn't work (test.docx
won't open), you've narrowed the problem to "before" the document.save()
call.
If it does work, you can try the download again and see, but pay particular attention to the type expected as the return
value from your download method. What you'd be getting here is a sequence of bytes. If it's expecting a file-like object or perhaps a path, that could be the problem too.
Moving the file pointer to the start (using out_stream.seek(0)) would only be relevant if you were returning a file-like object, like return out_stream
instead of return outstream.getvalue()
. The latter returns bytes
, which of course don't have a file pointer. BytesIO (or StringIO).getvalue() does not require setting the file cursor; it always returns the full contents of the object.
Also, instead of relying on contenttype
to get it right, I'd spell out the content-type header as: application/vnd.openxmlformats-officedocument.wordprocessingml.document
. If contenttype misidentified the file as a .doc format (pre-Word 2007) file as opposed to a .docx format (Word 2007 and later) file, that could also cause a problem.