encryptionassembly16-bitdcpu-16rc4-cipher

Writing RC4 for a 16 bit system


I am writing RC4 for the DCPU-16, however I have some questions before I begin.

RC4 algorithm:

//KSA
for i from 0 to 255
    S[i] := i
endfor
j := 0
for i from 0 to 255
    j := (j + S[i] + key[i mod keylength]) mod 256
    swap values of S[i] and S[j]
endfor

//PRGA
i := 0
j := 0
while GeneratingOutput:
    i := (i + 1) mod 256
    j := (j + S[i]) mod 256
    swap values of S[i] and S[j]
    K := S[(S[i] + S[j]) mod 256]
    output K
endwhile

As I am working with 16-bit words so each element of S[] can go from a range from 0-65535, instead of the expected 0-255. And K needs to be 0-65535, what would be the best approach to deal with this problem?

The options I see (and their problems) are:

  1. Still use Mod 255 everywhere and populate the output with two rounds concatenated (will take longer to run and I want to keep my CPB as low as possible)
  2. Tweak RC4 so K will be a 16 bit number while still using an array of length 255 for S[] (I want to do the crypto right so I am concerned about making mistakes tinkering with RC4.)

What is my best option? I feel that I may have to do #1, but I am hoping people here can instill confidence for me to do #3.


Solution

  • option 2 will make the encryption weaker

    you can do

    loop: add i,1 ;2 cycles
    and i,0xff ;-- &0xff is the same as %256 ;2 cycles
    add j,[i+arr];3 cycles
    and j,0xff;3 cycles
    set o,[j+arr];-- using overflow reg as swap var;2 cycles
    set [j+arr],[i+arr];3 cycles
    set [i+arr],o;2 cycles
    set a,[i+arr];-- calc index;2 cycles
    add a,[j+arr];3 cycles
    and a,0xff;3 cycles
    set b,[a+arr];2 cycles
    
    ;-- second octet
    add i,1
    and i,0xff
    add j,[i+arr]
    and j,0xff
    set o,[j+arr] 
    set [j+arr],[i+arr]
    set [i+arr],o
    set a,[i+arr]
    add a,[j+arr]
    and a,0xff
    shl b,8
    bor b,[a+arr]
    ;--output b
    set pc,loop
    

    this is about as fast as you can make it (57 cycles per 16 bit word unless I missed something) this assumes that S is static (the arr value in my code) and i and j are store in the registers (you can store them before/after S when you are outside of the code)

    trying to pack the array will make everything slower as you need to unpack it each time