I'm working on an API which, when a user uploads a file, processes this file on the fly to extract some data from it. The file can be quite large (up to 5 GB) and is not persisted on server.
I have a bunch of system tests for it which use the common test facility provided by Flask. I also know how to make the test upload a file.
There is, however, a case I haven't covered by the system tests. What if the user starts uploading a file, then drops the connection in the middle, while the server is being busy extracting the data from what it already received?
I played with curl
by starting a large file and terminating the client in the middle of the thing. The API seems to behave according to my expectations. But I would prefer a fully automated approach.
How can I do that? Is there something in Flask or Werkzeug which would allow me to perform such a test? If not, what could be a viable approach?
After searching for a while, it seemed that neither Flask's testing functions, nor Requests library could help me. The solution was therefore to do a manual HTTP request with Python's sockets.
In order to save time, the easiest way to emulate the request and study what was actually sent was to use curl
with a few additional parameters:
--trace-ascii -
made it possible to display everything curl
sends and receives. Very handy.
--limit-rate 3K
helped simulating a very slow connection which, in conjunction with timeout configuration in gunicorn
(gunicorn --timeout 2 ...
) allowed to reproduce a situation where the client starts POSTing a file, and then hangs for too long.
From there, I simply had to copy the behavior of curl
as it was reporting it in the terminal to the test code which was using socket.socket()
.
The test code was then, through a thread, streaming the file to the server while making regular pauses and reporting its progress to the main thread. When a part of a file, but not all of it, was sent, the main thread would abruptly terminate the thread which was streaming the file. It would then wait enough for gunicorn to time out, and then interact with the database to check whether the processing happened as expected or not.
One thing I didn't know is that Werkzeug buffers the request, which is absolutely not intuitive, given that when using werkzeug.formparser.parse_form_data()
, the method write
of the stream passed to Werkzeug is called every time a newline character is encountered. The trick is that it is first buffered, and only when the buffer is full, write
starts to be called for the data in the buffer. Originally, I was sending only files containing a few kilobytes, so it looked like Werkzeug was just reading the whole file in memory, and only then letting me process it. When I started sending files larger than its 65,536 bytes buffer, I noticed that I get the calls to write
in burst after the buffer is filled.