I am writing RC4 for the DCPU-16, however I have some questions before I begin.
RC4 algorithm:
//KSA
for i from 0 to 255
S[i] := i
endfor
j := 0
for i from 0 to 255
j := (j + S[i] + key[i mod keylength]) mod 256
swap values of S[i] and S[j]
endfor
//PRGA
i := 0
j := 0
while GeneratingOutput:
i := (i + 1) mod 256
j := (j + S[i]) mod 256
swap values of S[i] and S[j]
K := S[(S[i] + S[j]) mod 256]
output K
endwhile
As I am working with 16-bit words so each element of S[]
can go from a range from 0-65535, instead of the expected 0-255. And K needs to be 0-65535, what would be the best approach to deal with this problem?
The options I see (and their problems) are:
Mod 255
everywhere and populate the output with two rounds concatenated (will take longer to run and I want to keep my CPB as low as possible)K
will be a 16 bit number while still using an array of length 255 for S[]
(I want to do the crypto right so I am concerned about making mistakes tinkering with RC4.)What is my best option? I feel that I may have to do #1, but I am hoping people here can instill confidence for me to do #3.
option 2 will make the encryption weaker
you can do
loop: add i,1 ;2 cycles
and i,0xff ;-- &0xff is the same as %256 ;2 cycles
add j,[i+arr];3 cycles
and j,0xff;3 cycles
set o,[j+arr];-- using overflow reg as swap var;2 cycles
set [j+arr],[i+arr];3 cycles
set [i+arr],o;2 cycles
set a,[i+arr];-- calc index;2 cycles
add a,[j+arr];3 cycles
and a,0xff;3 cycles
set b,[a+arr];2 cycles
;-- second octet
add i,1
and i,0xff
add j,[i+arr]
and j,0xff
set o,[j+arr]
set [j+arr],[i+arr]
set [i+arr],o
set a,[i+arr]
add a,[j+arr]
and a,0xff
shl b,8
bor b,[a+arr]
;--output b
set pc,loop
this is about as fast as you can make it (57 cycles per 16 bit word unless I missed something) this assumes that S
is static (the arr value in my code) and i
and j
are store in the registers (you can store them before/after S
when you are outside of the code)
trying to pack the array will make everything slower as you need to unpack it each time