AES-NI seems to be optimized to encrypt/decrypt big chunks of data. However I'm trying to decrypt a password and I have many very small bits to try (iv + first cbc block, 32 bytes in total).
I'm using openssl at the moment, calling EVP_DecryptInit_ex
, EVP_DecryptUpdate
for every cycle (and EVP_CIPHER_CTX_init
once per thread).
I can do this around 2 million times per second on a single core.
I assume this is the sort of performance I can expect using AES-NI instructions and I shouldn't worry about optimising this further. Is this correct?
Does anyone have any idea how much faster this might be on a high end GPU or not-too-expensive FPGA?
FPGA: You can convert an input block to an output block on any reasonable FPGA with a 2-cycle throughput at several hundred MHz, with a latency of 16 cycles. So, possibly 256 Mblocks/s pipelined, or maybe 32 Mblocks/s not pipelined. You could get maybe 5 of these on a reasonably cheap FPGA, or 30+ on an expensive one. YMMV.