ruby-on-rails-3multithreadingcurb

Rails - loop curls request eating memory


I use the gem Curb (tried too with httparty) to performing a lot of http request and this working good. But in one of my (rake) task (where I do 20k+ requests) I have a memory problem (Rails "eats" more than 2GB of RAM until there is no free memory anymore).

Seems that Rails "don't wait" for a response and go ahead in another thread in the loop, the problem is that in this manner there will be created a lot of objects not collected by the garbage collector (I think) and is the reason of the memory leak.

There is a method to say to rails to wait until the response is came? (I tried with sleep but is not a stable solution).

I have a pseudocode like this:

def the_start
  while start_date <= end_date do                  # ~ 140 loop 
    a_method_that_do_sub_specifics_call
  end
end

def a_method_that_do_sub_specifics_call
    some_data.each do |r|                          # ~ 180 loop
        do_a_call
        #do something with models (update/create entries,...)
    end
end

def do_a_call                                      # called ~ 25k times
    # with the gem Curb version
    req = Curl::Easy.new do |curl| 
       curl.ssl_verify_peer = false
       curl.url = url
       curl.headers['Content-type'] = 'application/json'
    end
    req.perform

    # actual version, with httparty gem
    req = HTTParty.get("#{url}",
        :headers => {'Content-type' => 'application/json'})
end

Seems that Rails doesn't wait to have the results of req.perform.

EDIT:
Tried too to instanciate only once the Curl::Easy object, using Curl::Easy.perform() and req.close (that should calls implicitly GC) after the call but without success ever big memory usage. The only solution that (I think) can work is to "blocks" rails until the response is came, but how?

EDIT 2
In another task I call only the a_method_that_do_sub_specifics_call without problems.

EDIT 3
After some performance mod (placing find_each(:batch_size => ...), GC.start,...) the task works a little better.. now the first ~100 loop (do_a_call) work good, after that the memory usage jump from 100Mb to 2Gb+ again.


Solution

  • After days of debugging, reading tons of forums and posts I have found the solution:
    a modest class variable string that grows until a memory leak occours.

    Some useful notes that I have earned in my trip:

    Curb vs HTTParty
    Between these two gems that perform curl requests, the best in term of performance is Curb. http://bibwild.wordpress.com/2012/04/30/ruby-http-performance-shootout-redux/

    Pay attention at the class variables
    My problem was a debug/info variable string class that continues growing, avoid to use class variable that are never collected by the garbage collector. In my specific case was:

    @status = "#{@status} Warning - response is empty for #{description}\n"
    

    Perform some manual garbage collection
    Perform some manual GC.start at the critical point to ensure to free the memory that are no more necessary. Remember that calling GC.start doesn't perform an instantaneous call to the garbage collector, it only suggests it.

    Calling ActiveRecords array
    When calling big ActiveRecords use .find_each, e.g.:

    Model.find_each(:batch_size => 50) do |row|
    

    This perform a query only for 50 (or something smaller than default value) row every time, better than calling a single query with 1k row. (I guess that the default batch_size is 1000).

    Useful links: