rubymultithreadingconcurrencytruecrypt

Ruby: CPU-Load degradation of concurent/multithreaded task?


Preamble: I am working on a project to restore truecrypt container. It was cut to more than 3M small files in most likely random order and the goal is to find either the beginning or the ending of the container containing the encryption keys.

To do so I’ve written a small ruby script that starts many truecrypt processes concurrently attempting to mount the main or restore the backup headers. Interaction with truecrypt occures through spawned PTYs:

  PTY.spawn(@cmd) do |stdout, stdin, pid|
    @spawn = {stdout: stdout, stdin: stdin, pid: pid}

    if test_type == :forward
      process_truecrypt_forward
    else
      process_truecrypt_backward
    end

    stdin.puts
    pty_expect('Incorrect password')

    Process.kill('INT', pid)
    stdin.close
    stdout.close
    Process.wait(pid)
  end

This all works fine and successfully finds required pieces of a test container. To speed things up (and I need to proccess over 3M pieces) I've first used Ruby MRI multithreading and after reading about problems with it switched to concurent-ruby.

My implementation is pretty straightforward:

log 'Starting DB test'
concurrent_db = Concurrent::Array.new(@db)

futures = []

progress_bar = initialize_progress_bar('Running DB test', concurrent_db.size)

MAXIMUM_FUTURES.times do
  log "Started new future, total #{futures.size} futures"

  futures << Concurrent::Future.execute do
    my_piece = nil

    run = 1

    until concurrent_db.empty?
      my_piece = concurrent_db.slice!(0, SLICE_PER_FUTURE)
      break unless my_piece
      log "Run #{run}, sliced #{my_piece.size} pieces, #{concurrent_db.size} left"

      my_piece.each {|a| run_single_test(a)}
      progress_bar.progress += my_piece.size
      run += 1
    end

    log 'Future finished'
  end
end

Than I rented a large AWS Instance with 74 CPU cores and thought: "now I gonna proccess it fast". But the problem is, that no matter how many futures/threads (and I mean 20 or 1000) I launch simultaneously I am not reaching over ~50 checks/second.

When I launch 1000 threads the CPU load keeps at 100% only for 20-30 minutes and than goes down till it reaches somewhat of 15% and it stays so. Graph of typical CPU load within such a run. Disk load is not an issue, I am hitting 3MiB/s at maximum, using Amazon EBS storage.

What am I missing? Why can't I utilize 100% cpu and achieve better perfomance?


Solution

  • It's hard to say why exactly you aren't seeing the benefits of multithreading. But here's my guess.

    Let's say you have a really intensive Ruby method that takes 10 seconds to run called do_work. And, even worse, you need to run this method 100 times. Rather than wait 1000 seconds, you might try to multithread it. That could divide the work among your CPU cores, halving or maybe even quartering the runtime:

    Array.new(100) { Thread.new { do_work } }.each(&:join)
    

    But no, this is probably still going to take 1000 seconds to finish. Why?

    The Global VM Lock

    Consider this example:

    thread1 = Thread.new { class Foo; end; Foo.new }
    thread2 = Thread.new { class Foo; end; Foo.new }
    

    Creating a class in Ruby does a lot of stuff under the hood, for example it has to create an actual class object and assign that object's pointer to a global constant (in some order). What happens if thread1 registers that global constant, gets half way through creating the actual class object and then thread2 starts running, says "Oh, Foo already exists. Let's go ahead and run Foo.new". What happens since the class hasn't been fully defined? Or what if both thread1 and thread2 create a new class object and then both try to register their class as Foo? Which one wins? What about the class object that was created and now doesn't get registered?

    The official Ruby solution for this is simple: don't actually run this code in parallel. Instead, there is one single, massive mutex called "the global VM lock" that protects anything that modifies the Ruby VM's state (such as making a class). So while the two threads above may be interleaved in various ways, it's impossible for the VM to end up in an invalid state because each VM operation is essentially atomic.

    Example

    This takes about 6 seconds to run on my laptop:

    def do_work
      Array.new(100000000) { |i| i * i }
    end
    

    This takes about 18 seconds, obviously

    3.times { do_work }
    

    But, this also takes about 18, because the GVL prevents the threads from actually running in parallel

    Array.new(3) { Thread.new { do_work } }.each(&:join)
    

    This also takes 6 seconds to run

    def do_work2
      sleep 6
    end
    

    But now this also takes about 6 seconds to run:

    Array.new(3) { Thread.new { do_work2 } }.each(&:join)
    

    Why? If you dig through the Ruby source code, you find that sleep ultimately calls the C function native_sleep and in there we see

    GVL_UNLOCK_BEGIN(th);
    {
        //...
    }
    GVL_UNLOCK_END(th);
    

    The Ruby devs know that sleep doesn't affect the VM state, so they explicitly unlocked the GVL to allow it to run in parallel. It can be tricky to figure out exactly what locks/unlocks the GVL and when you're going to see the performance benefit of it.

    How to fix your code

    My guess is that something in your code is hitting the GVL so while some parts of your threads are running in parallel (generally any subprocess/PTY stuff does), there's still contention between them in the Ruby VM causing some parts to serialize.

    Your best bet with getting truly parallel Ruby code is to simplify it to something like this:

    Array.new(x) { Thread.new { do_work } }
    

    where you're sure that do_work is something simple that definitely unlocks the GVL, such as spawning a subprocess. You could try moving your Truecrypt code into a little shell script so that Ruby doesn't have to interact with it anymore once it gets going.

    I recommend starting with a little benchmark that just starts a few subprocesses, and make sure that they are actually running in parallel by comparing the time to running them serially.