I am attempting to use this code which is taken from the intel whitepaper as shown below.
My aim is to perform 256-bit block encryption using AES-NI.
I have successfully derived the key schedule using the method, this method was provided in the Intel AES-NI library which is used to expand the keys: iEncExpandKey256(key,expandedKey);
and the expandedKey works fine in my non AES-NI implementation of AES.
However, when I pass the values into Rijndael256_encrypt(testVector,testResult,expandedKey,32,1) ;
I get an error of "Attempting to access protected memory and this usually indicates that the memory is corrupt" and the line of code which is causing this is data1 = _mm_xor_si128(data1, KS[0]); /* round 0 (initial xor) */
as shown below.
So my question is , what could be the possible errors for such an error? My current hypothesis is that data1 and KS[0] could be of different size and I am currently still verifying it. Other than that , I'm not really sure where else I could look at. Would be greatly appreciated if someone can point me in the right direction to troubleshoot this error.
#include <wmmintrin.h>
#include <emmintrin.h>
#include <smmintrin.h>
void Rijndael256_encrypt (unsigned char *in,
unsigned char *out,
unsigned char *Key_Schedule,
unsigned long long length,
int number_of_rounds)
{
__m128i tmp1, tmp2, data1 ,data2;
__m128i RIJNDAEL256_MASK =
_mm_set_epi32(0x03020d0c, 0x0f0e0908, 0x0b0a0504, 0x07060100);
__m128i BLEND_MASK=
_mm_set_epi32(0x80000000, 0x80800000, 0x80800000, 0x80808000);
__m128i *KS = (__m128i*)Key_Schedule;
int i,j;
for(i=0; i < length/32; i++) { /* loop over the data blocks */
data1 = _mm_loadu_si128(&((__m128i*)in)[i*2+0]); /* load data block */
data2 = _mm_loadu_si128(&((__m128i*)in)[i*2+1]);
data1 = _mm_xor_si128(data1, KS[0]); /* round 0 (initial xor) */
data2 = _mm_xor_si128(data2, KS[1]);
/* Do number_of_rounds-1 AES rounds */
for(j=1; j < number_of_rounds; j++) {
/*Blend to compensate for the shift rows shifts bytes between two
128 bit blocks*/
tmp1 = _mm_blendv_epi8(data1, data2, BLEND_MASK);
tmp2 = _mm_blendv_epi8(data2, data1, BLEND_MASK);
/*Shuffle that compensates for the additional shift in rows 3 and 4
as opposed to rijndael128 (AES)*/
tmp1 = _mm_shuffle_epi8(tmp1, RIJNDAEL256_MASK);
tmp2 = _mm_shuffle_epi8(tmp2, RIJNDAEL256_MASK);
/*This is the encryption step that includes sub bytes, shift rows,
mix columns, xor with round key*/
data1 = _mm_aesenc_si128(tmp1, KS[j*2]);
data2 = _mm_aesenc_si128(tmp2, KS[j*2+1]);
}
tmp1 = _mm_blendv_epi8(data1, data2, BLEND_MASK);
tmp2 = _mm_blendv_epi8(data2, data1, BLEND_MASK);
tmp1 = _mm_shuffle_epi8(tmp1, RIJNDAEL256_MASK);
tmp2 = _mm_shuffle_epi8(tmp2, RIJNDAEL256_MASK);
tmp1 = _mm_aesenclast_si128(tmp1, KS[j*2+0]); /*last AES round */
tmp2 = _mm_aesenclast_si128(tmp2, KS[j*2+1]);
_mm_storeu_si128(&((__m128i*)out)[i*2+0],tmp1);
_mm_storeu_si128(&((__m128i*)out)[i*2+1],tmp2);
}
}
You have:
UCHAR* Key_Schedule=Key_schedule+4;
This unaligns Key_Schedule
, since Key_schedule
is (I hope!) aligned and you've added 32-bits to it.
You're asking the CPU to do something that the hardware is not capable of doing because of the way the data lines are wired. This is a gross oversimplification, but: You can think of the CPU as having sixteen 8-bit slots that it has to read from. To read data, it sends out an address which is the byte address divided by 16 and then decides which slots to read from. If the byte address of all 16 bytes that compose the 128-bit address aren't the same when divided by 16, then it's not possible to read the 16 bytes into the 16 slots.
If you don't want to impose alignment requirements on all the parameters to the function, then you'll need to have the function itself copy them into aligned buffers.
SSE operations need to be aligned to 16 for loading and storing[.] -- AES Intrinsics