phprubyasynchronousrackevented-io

Lightweight streaming HTTP proxy for Rack (Ruby CPU-light HTTP client library)


So I am experimenting with a situation where I want to stream huge files from a third-party URL, through my server, to the requesting client.

So far I have tried implementing this with Curb or Net::HTTP by adhering to the standard Rack practice of "eachable" response bodies, like so:

class StreamBody
  ...
  def each
    some_http_library.on_body do | body_chunk |
      yield(body_chunk)
    end
  end
end

However I cannot make this system use less than, say, 40% CPU (on my MacBook Air). If I try to do the same with Goliath, using em-synchrony (like advised on the Goliath page) I can get the CPU usage down to about 25% CPU, however I cannot manage to flush the headers. My streaming download "hangs" in the requesting client and the headers show up once the entire response has been sent to the client, no matter what headers I supply.

Am I correct in thinking that this is one of those cases where Ruby just sucks marvelously and I have to turn to the go's and nodejs'es of the world instead?

By comparison, we currently use PHP streaming from CURL to the PHP output stream and that works with very little CPU overhead.

Or is there an upstream proxying solution that I could ask to handle my stuff? Problem is - I want to reliably call a Ruby function once the entire body has been sent to the socket, and things like nginx proxies will not do it for me.

UPDATE: I have tried to do a simple benchmark for HTTP clients and it looks like most of the CPU use are the HTTP client libs. There are benchmarks for Ruby HTTP clients, but they are based on the response receive times - whereas CPU usage is never mentioned. In my test I have performed an HTTP streamed download writing the result to /dev/null, and got consistent 30-40% CPU usage, which about matches the CPU usage I have when streaming through any Rack handler.

UPDATE: It turns out that most Rack handlers (Unicorn etc) use a write() loop on the response body, which might enter a busy wait (with high CPU load) when the response cannot be written fast enough. This can be mitigated to a degree by using rack.hijack and writing to the output socket using write_nonblock an IO.select (surprizsed the servers do not do that by themselves).

lambda do |socket|
  begin
    rack_response_body.each do | chunk |
      begin
        bytes_written = socket.write_nonblock(chunk)
        # If we could write only partially, make sure we do a retry on the next
        # iteration with the remaining part
        if bytes_written < chunk.bytesize
          chunk = chunk[bytes_written..-1]
          raise Errno::EINTR
        end
      rescue IO::WaitWritable, Errno::EINTR # The output socket is saturated.
        IO.select(nil, [socket]) # Then let's wait on the socket to be writable again
        retry # and off we go...
      rescue Errno::EPIPE # Happens when the client aborts the connection
        return
      end
    end
  ensure
    socket.close rescue IOError
    rack_response_body.close if rack_response_body.respond_to?(:close)
  end
end

Solution

  • There were no answers, but at the end we did manage to find a solution. It is remarkably successful as we are pumping terabytes of data through it, on a daily basis. Here are the key ingredients:

    The main problem with a desire to build something like this with Ruby is something I call string churn. Basically, allocating strings in the VM is not free. When you are pushing lots of data through, you will end up allocating a Ruby String per chunk of data received from the upstream source, and possible you will also end up allocating strings if you are unable to write() that entire chunk to the socket that represents your client connected over TCP. So of all approaches we tried, we were unable to find a solution that would let us avoid string churn - before we stumbled on Patron, that is.

    Patron, as it turns out, is the only Ruby HTTP client that allows direct-to-file writes in userspace. This means that you can download some data over HTTP without allocating a ruby String for the data that you pull. Patron has a function that will open a FILE* pointer and write direct to that pointer, using libCURL callbacks. This happens while the Ruby GVL is unlocked, since everything gets folded into the C level. In practice this means that at the "pull" stage nothing will be allocated in the Ruby heap to store the response body.

    Note that curb, the other widely-used CURL binding library, does not have that feature - it will allocate Ruby strings on the heap and yield them to you, which defeats the purpose.

    The next step is serving that content to the TCP socket. As it happens - again - there are three ways to do it.

    Either way, you need to get at the TCP socket - so you need to have either full or partial Rack hijack support (verify your webserver documentation for whether it has it or not).

    We decided to go with the third option. sendfile is a wonderful gem by the author of Unicorn and Rainbows, and it accomplishes just that - give it a Ruby File object, and the TCPSocket, and it will ask the kernel to send the file to the socket bypassing as much machinery as possible. Again, you do not have to read anything into the heap. So, in the end, here is the approach that we went for (pseudo-code-ish, does not handle edge cases):

    # Use Tempfile to allocate a unique file name
    tf = Tempfile.new('chunk')
    
    # Download a part of the file using the Range header 
    Patron::Session.new.get_file(the_url, tf.path, {'Range' => '..-..'})
    
    # Use the blocking sendfile call (for demo purposes, you can also send in chunks).
    # Note that non-blocking sendfile() is broken on OSX
    socket.sendfile(file, start_reading_at=0, send_bytes=tf.size)
    
    # Make sure to get rid of the file
    tf.close; tf.unlink
    

    This allows us to service multiple connections, without eventing, with very small CPU load and very small heap pressure. We are routinely seeing boxes serving hundreds of users using about 2% CPU while doing so. And the Ruby GC stays happy. Essentially, the only thing we do not like with this implementation is the 8MB per thread RAM overhead imposed by the MRI. However, to work around that we would need to switch to an evented server (spaghetti code galore) or write our own IO reactor that would multiplex a large number of connections onto a much smaller salvo of threads, which is certainly doable but would take too much time.

    Hopefully this will help someone.