Preamble: I am working on a project to restore truecrypt container. It was cut to more than 3M small files in most likely random order and the goal is to find either the beginning or the ending of the container containing the encryption keys.
To do so I’ve written a small ruby script that starts many truecrypt processes concurrently attempting to mount the main or restore the backup headers. Interaction with truecrypt occures through spawned PTYs:
PTY.spawn(@cmd) do |stdout, stdin, pid|
@spawn = {stdout: stdout, stdin: stdin, pid: pid}
if test_type == :forward
process_truecrypt_forward
else
process_truecrypt_backward
end
stdin.puts
pty_expect('Incorrect password')
Process.kill('INT', pid)
stdin.close
stdout.close
Process.wait(pid)
end
This all works fine and successfully finds required pieces of a test container. To speed things up (and I need to proccess over 3M pieces) I've first used Ruby MRI multithreading and after reading about problems with it switched to concurent-ruby.
My implementation is pretty straightforward:
log 'Starting DB test'
concurrent_db = Concurrent::Array.new(@db)
futures = []
progress_bar = initialize_progress_bar('Running DB test', concurrent_db.size)
MAXIMUM_FUTURES.times do
log "Started new future, total #{futures.size} futures"
futures << Concurrent::Future.execute do
my_piece = nil
run = 1
until concurrent_db.empty?
my_piece = concurrent_db.slice!(0, SLICE_PER_FUTURE)
break unless my_piece
log "Run #{run}, sliced #{my_piece.size} pieces, #{concurrent_db.size} left"
my_piece.each {|a| run_single_test(a)}
progress_bar.progress += my_piece.size
run += 1
end
log 'Future finished'
end
end
Than I rented a large AWS Instance with 74 CPU cores and thought: "now I gonna proccess it fast". But the problem is, that no matter how many futures/threads (and I mean 20 or 1000) I launch simultaneously I am not reaching over ~50 checks/second.
When I launch 1000 threads the CPU load keeps at 100% only for 20-30 minutes and than goes down till it reaches somewhat of 15% and it stays so. Graph of typical CPU load within such a run. Disk load is not an issue, I am hitting 3MiB/s at maximum, using Amazon EBS storage.
What am I missing? Why can't I utilize 100% cpu and achieve better perfomance?
It's hard to say why exactly you aren't seeing the benefits of multithreading. But here's my guess.
Let's say you have a really intensive Ruby method that takes 10 seconds to run called do_work
. And, even worse, you need to run this method 100 times. Rather than wait 1000 seconds, you might try to multithread it. That could divide the work among your CPU cores, halving or maybe even quartering the runtime:
Array.new(100) { Thread.new { do_work } }.each(&:join)
But no, this is probably still going to take 1000 seconds to finish. Why?
Consider this example:
thread1 = Thread.new { class Foo; end; Foo.new }
thread2 = Thread.new { class Foo; end; Foo.new }
Creating a class in Ruby does a lot of stuff under the hood, for example it has to create an actual class object and assign that object's pointer to a global constant (in some order). What happens if thread1 registers that global constant, gets half way through creating the actual class object and then thread2 starts running, says "Oh, Foo
already exists. Let's go ahead and run Foo.new
". What happens since the class hasn't been fully defined? Or what if both thread1 and thread2 create a new class object and then both try to register their class as Foo
? Which one wins? What about the class object that was created and now doesn't get registered?
The official Ruby solution for this is simple: don't actually run this code in parallel. Instead, there is one single, massive mutex called "the global VM lock" that protects anything that modifies the Ruby VM's state (such as making a class). So while the two threads above may be interleaved in various ways, it's impossible for the VM to end up in an invalid state because each VM operation is essentially atomic.
This takes about 6 seconds to run on my laptop:
def do_work
Array.new(100000000) { |i| i * i }
end
This takes about 18 seconds, obviously
3.times { do_work }
But, this also takes about 18, because the GVL prevents the threads from actually running in parallel
Array.new(3) { Thread.new { do_work } }.each(&:join)
This also takes 6 seconds to run
def do_work2
sleep 6
end
But now this also takes about 6 seconds to run:
Array.new(3) { Thread.new { do_work2 } }.each(&:join)
Why? If you dig through the Ruby source code, you find that sleep
ultimately calls the C function native_sleep
and in there we see
GVL_UNLOCK_BEGIN(th);
{
//...
}
GVL_UNLOCK_END(th);
The Ruby devs know that sleep
doesn't affect the VM state, so they explicitly unlocked the GVL to allow it to run in parallel. It can be tricky to figure out exactly what locks/unlocks the GVL and when you're going to see the performance benefit of it.
My guess is that something in your code is hitting the GVL so while some parts of your threads are running in parallel (generally any subprocess/PTY stuff does), there's still contention between them in the Ruby VM causing some parts to serialize.
Your best bet with getting truly parallel Ruby code is to simplify it to something like this:
Array.new(x) { Thread.new { do_work } }
where you're sure that do_work
is something simple that definitely unlocks the GVL, such as spawning a subprocess. You could try moving your Truecrypt code into a little shell script so that Ruby doesn't have to interact with it anymore once it gets going.
I recommend starting with a little benchmark that just starts a few subprocesses, and make sure that they are actually running in parallel by comparing the time to running them serially.