Why is there a dramatic speed drop for access to static ram over cache c++?

Background

I have been looking into potentially using the MPC5200 static ram space as scratch pad memory. We have 16Kb of unused memory that appears on the processor bus (source).

Now some important implementation notes are:

This memory is used by the BestComm DMA controller, under RTEMS this will essentially set up a task table at the start of SRAM with a set of 16 tasks that can run as buffers for peripheral interface, I2C, Ethernet etc. In order to use this space without conflict and knowing that our system only uses a about 2Kb of Ethernet driver buffers, I offset the start of SRAM by 8Kb, so now we have 8Kb of memory that we know wont be used by the system.
RTEMS defines an array that points to static memory as follows:

(source)

typedef struct {
    ...
    ...
    volatile uint8_t    sram[0x4000];
 } mpc5200_t;

 extern volatile mpc5200_t mpc5200;

And i know that the sram array points to static memory because when I edit the first section and print out the memory block (MBAR + 0x8000 source)

So from here i can say the following, I have the RTEMS defined access to the SRAM via mpc5200.sram[0 -> 0x2000]. This means i can start doing some testing on the speed I can get out of it.

Test

In order to evaluate the speed, i set up the following test:

int a; // Global that is separate from the test. 

**TEST**

// Set up the data.
const unsigned int listSize = 0x1000;
uint8_t data1[listSize];
for (int k = 0; k < listSize; ++k) {
    data1[k] = k;
    mpc5200.sram[k] = k;
}

// Test 1, data on regular stack.
clock_t start = clock();
for (int x = 0; x < 5000; ++x) {
    for (int y = 0; y < 0x2000; ++y) {
        a = (data1[y]);
    }
}
double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
printf("elapsed dynamic: %f\n" ,elapsedTime);

// Test 2, get data from the static memory.
start = clock();
for (int x = 0; x < 5000; ++x) {
    for (int y = 0; y < 0x2000; ++y) {
        a = (mpc5200.sram[y]);
    }
}
elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
printf("elapsed static: %f\n" ,elapsedTime);

Pretty simple, the concept is that we are iterating over the available space and setting a global. We should expect that the static memory should have the same approximate time.

RESULT

So we get the following:

elapsedDynamic = 1.415
elapsedStatic = 6.348

So there is something going on here, because the static is almost 6x slower than the cache.

Hypothesis

So i had 3 ideas about why this is:

Cache misses, i thought maybe the fact that we are mixing dynamic and static ram that something strange is happening. So i tried this test:

// Some pointers to use as incrementers
uint8_t *i = reinterpret_cast<uint8_t*>(0xF0000000+0x8000+0x1000+1);
uint8_t *j = reinterpret_cast<uint8_t*>(0xF0000000+0x8000+0x1000+2);
uint8_t *b = reinterpret_cast<uint8_t*>(0xF0000000+0x8000+0x1000+3);


// I replaced all of the potential memory accesses with the static ram
// variables. That way the tests have no interaction in terms of 
// memory locations. 
start = clock();
// Test 2, get data from the static memory.
for ((*i) = 0; (*i) < 240; ++(*i)) {
    for ((*j) = 0; (*j) < 240; ++(*j)) {
        (*b) = (mpc5200.sram[(*j)]);
    }
}
elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
printf("elapsed static: %f\n" ,elapsedTime);

We have the following results:

elapsedDynamic = 0.0010
elapsedStatic = 0.2010

So now it is 200 times slower? So i guess it is not to do with that?

Static memory different to normal, The next thing i thought was that maybe it doesn't interact how i thought it would because of this line:

MPC5200 contains 16KBytes of on-chip SRAM. This memory is directly accessible by the BestComm DMA unit. It is used primarily as storage for task table and buffer descriptors used by BestComm DMA to move peripheral data to and from SDRAM or other locations. These descriptors must be downloaded to the SRAM at boot. This SRAM resides in the MPC5200 internal register space and is also accessible by the processor core. As such it can be used for other purposes, such as scratch pad storage. The 16kBytes SRAM starts at location MBAR + 0x8000.

(source)

I am not sure how to confirm or deny this?

Slower Static Clock, Perhaps the static memory runs on a slower clock, like in some systems?

This can be disproved by looking in the manual:

enter image description here

(source)

The SRAM and the processor were on the same clock, the XLB_CLK runs at the Processor Fundamental Frequency (source)

QUESTION

What could be causing this, are there reasons in general not to use SRAM for scratch pad storage? I know on modern processors this would not even be considered but this is an older embedded processor and we are struggling for speed and space.

EXTRA TESTS

So after the comments below i performed some extra tests:

Add volatile to the stack member to see if the speeds are more equal:

elapsedDynamic = 0.98
elapsedStatic = 5.97

So still much faster and not really any change with the volatile??

Disassemble the code to see what is happening

// original code
int a = 0;
uint8_t data5[0x2000];
void assemblyFunction(void) {
    int * test = (int*) 0xF0008000;
    mpc5200.sram[0] = a;
    data5[0] = a;
    test[0] = a;
}

void assemblyFunction(void) {
// I think this is to load up A
0:  3d 20 00 00     lis     r9,0
8:  80 09 00 00     lwz     r0,0(r9)
 14:    54 0a 06 3e     clrlwi  r10,r0,24

    mpc5200.sram[0] = a;   
    1c: 3d 60 00 00     lis     r11,0
  20:   39 6b 00 00     addi    r11,r11,0
  28:   3d 6b 00 01     addis   r11,r11,1 // Where do these come from?
  2c:   99 4b 80 00     stb     r10,-32768(r11)

test[0] = a;
   c:   3d 20 f0 00     lis     r9,-4096 // This should be the same as above??
  10:   61 29 80 00     ori     r9,r9,32768
  24:   90 09 00 00     stw     r0,0(r9)

    data5[0] = a;
    4:  3d 60 00 00     lis     r11,0
    18: 99 4b 00 00     stb     r10,0(r11)

I am not particularly good at interpenetrating assembler, but perhaps we have a problem here? Accessing and setting the memory from a global does seem to take more instructions for the SRAM?

From the above test it seems that there are less instructions for the pointer so i added this:

uint8_t *p = (uint8_t*)0xF0008000;

// Test 3, get data from static with direct pointer.
for (int x = 0; x < 5000; ++x) {
    for (int y = 0; y < 0x2000; ++y) {
        a = (p[y]);
    }
}

And i get the following result:

elapsed dynamic: 0.952750
elapsed static: 5.160250
elapsed pointer: 5.642125

So the pointer takes EVEN LONGER! I would have thought it would be exactly the same? This is just getting stranger.

Solution

So it looks like there are a couple of factors, that might lead to this.

I am no longer sure that the SRAM is running at the same clock speed as the Processor. As pointed out by @user58697, the SRAM is on IPB clock time even though it looks like the bus is on XLB time. On top of that there is this diagram:

enter image description here (source)

This seems to indicate that he memory clock is on the XLB path but that the XLB path is at a lower frequency than the CORE clock. This can be confirmed here:

enter image description here

(source)

Which indicates that the XLB_Bus runs at a slower rate than the processor.

To test that the SRAM is at least faster than the dynamic ram i conducted the following test:

// Fill up the cache with pointless stuff
for (int i = 0; i < 4097; ++i) {
    a = (int)TSin[i];
}

// 1. Test the dynamic RAM access with a cache miss every time. 
ticks = timer_now();
// += 16 to ensure a cache line miss.
for (int y = 0; y < listSize; y += 16) {
    a = (data1[y]);
}
elapsedTicks = timer_now() - ticks;

// Fill up the cache with pointless stuff again ...

ticks = timer_now();
// Test 2, do the same cycles but with static memory.
for (int y = 0; y < listSize; y += 16) {
    a = (mpc5200.sram[y]);
}
elapsedTicks = timer_now() - ticks;

And with this we get the following results:

elapsed dynamic:  294.84 uS
elapsed static:  57.78 uS

So what we can say here is that the static RAM is faster than the dynamic RAM (expected) but when the dynamic RAM is loaded into cache accessing the static ram is much much slower because cache access is at processor speed and the static ram speed is much less than this.