I have been looking into potentially using the MPC5200 static ram space as scratch pad memory. We have 16Kb of unused memory that appears on the processor bus (source).
Now some important implementation notes are:
This memory is used by the BestComm DMA controller, under RTEMS
this will essentially set up a task table at the start of SRAM with a set of 16 tasks that can run as buffers for peripheral interface, I2C, Ethernet etc. In order to use this space without conflict and knowing that our system only uses a about 2Kb of Ethernet driver buffers, I offset the start of SRAM by 8Kb, so now we have 8Kb of memory that we know wont be used by the system.
RTEMS
defines an array that points to static memory as follows:
(source)
typedef struct {
...
...
volatile uint8_t sram[0x4000];
} mpc5200_t;
extern volatile mpc5200_t mpc5200;
And i know that the sram array points to static memory because when I edit the first section and print out the memory block (MBAR + 0x8000
source)
So from here i can say the following, I have the RTEMS defined access to the SRAM
via mpc5200.sram[0 -> 0x2000]
. This means i can start doing some testing on the speed I can get out of it.
In order to evaluate the speed, i set up the following test:
int a; // Global that is separate from the test.
**TEST**
// Set up the data.
const unsigned int listSize = 0x1000;
uint8_t data1[listSize];
for (int k = 0; k < listSize; ++k) {
data1[k] = k;
mpc5200.sram[k] = k;
}
// Test 1, data on regular stack.
clock_t start = clock();
for (int x = 0; x < 5000; ++x) {
for (int y = 0; y < 0x2000; ++y) {
a = (data1[y]);
}
}
double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
printf("elapsed dynamic: %f\n" ,elapsedTime);
// Test 2, get data from the static memory.
start = clock();
for (int x = 0; x < 5000; ++x) {
for (int y = 0; y < 0x2000; ++y) {
a = (mpc5200.sram[y]);
}
}
elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
printf("elapsed static: %f\n" ,elapsedTime);
Pretty simple, the concept is that we are iterating over the available space and setting a global. We should expect that the static memory should have the same approximate time.
So we get the following:
elapsedDynamic = 1.415
elapsedStatic = 6.348
So there is something going on here, because the static is almost 6x slower than the cache.
So i had 3 ideas about why this is:
.
// Some pointers to use as incrementers
uint8_t *i = reinterpret_cast<uint8_t*>(0xF0000000+0x8000+0x1000+1);
uint8_t *j = reinterpret_cast<uint8_t*>(0xF0000000+0x8000+0x1000+2);
uint8_t *b = reinterpret_cast<uint8_t*>(0xF0000000+0x8000+0x1000+3);
// I replaced all of the potential memory accesses with the static ram
// variables. That way the tests have no interaction in terms of
// memory locations.
start = clock();
// Test 2, get data from the static memory.
for ((*i) = 0; (*i) < 240; ++(*i)) {
for ((*j) = 0; (*j) < 240; ++(*j)) {
(*b) = (mpc5200.sram[(*j)]);
}
}
elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
printf("elapsed static: %f\n" ,elapsedTime);
We have the following results:
elapsedDynamic = 0.0010
elapsedStatic = 0.2010
So now it is 200 times slower? So i guess it is not to do with that?
Static memory different to normal, The next thing i thought was that maybe it doesn't interact how i thought it would because of this line:
MPC5200 contains 16KBytes of on-chip SRAM. This memory is directly accessible by the BestComm DMA unit. It is used primarily as storage for task table and buffer descriptors used by BestComm DMA to move peripheral data to and from SDRAM or other locations. These descriptors must be downloaded to the SRAM at boot. This SRAM resides in the MPC5200 internal register space and is also accessible by the processor core. As such it can be used for other purposes, such as scratch pad storage. The 16kBytes SRAM starts at location MBAR + 0x8000.
(source)
I am not sure how to confirm or deny this?
This can be disproved by looking in the manual:
(source)
The SRAM and the processor were on the same clock, the XLB_CLK
runs at the Processor Fundamental Frequency (source)
What could be causing this, are there reasons in general not to use SRAM for scratch pad storage? I know on modern processors this would not even be considered but this is an older embedded processor and we are struggling for speed and space.
So after the comments below i performed some extra tests:
volatile
to the stack member to see if the speeds are more equal:.
elapsedDynamic = 0.98
elapsedStatic = 5.97
So still much faster and not really any change with the volatile??
.
// original code
int a = 0;
uint8_t data5[0x2000];
void assemblyFunction(void) {
int * test = (int*) 0xF0008000;
mpc5200.sram[0] = a;
data5[0] = a;
test[0] = a;
}
void assemblyFunction(void) {
// I think this is to load up A
0: 3d 20 00 00 lis r9,0
8: 80 09 00 00 lwz r0,0(r9)
14: 54 0a 06 3e clrlwi r10,r0,24
mpc5200.sram[0] = a;
1c: 3d 60 00 00 lis r11,0
20: 39 6b 00 00 addi r11,r11,0
28: 3d 6b 00 01 addis r11,r11,1 // Where do these come from?
2c: 99 4b 80 00 stb r10,-32768(r11)
test[0] = a;
c: 3d 20 f0 00 lis r9,-4096 // This should be the same as above??
10: 61 29 80 00 ori r9,r9,32768
24: 90 09 00 00 stw r0,0(r9)
data5[0] = a;
4: 3d 60 00 00 lis r11,0
18: 99 4b 00 00 stb r10,0(r11)
I am not particularly good at interpenetrating assembler, but perhaps we have a problem here? Accessing and setting the memory from a global does seem to take more instructions for the SRAM
?
.
uint8_t *p = (uint8_t*)0xF0008000;
// Test 3, get data from static with direct pointer.
for (int x = 0; x < 5000; ++x) {
for (int y = 0; y < 0x2000; ++y) {
a = (p[y]);
}
}
And i get the following result:
elapsed dynamic: 0.952750
elapsed static: 5.160250
elapsed pointer: 5.642125
So the pointer takes EVEN LONGER! I would have thought it would be exactly the same? This is just getting stranger.
So it looks like there are a couple of factors, that might lead to this.
SRAM
is on IPB clock time even though it looks like the bus is on XLB time. On top of that there is this diagram:(source)
This seems to indicate that he memory clock is on the XLB path but that the XLB path is at a lower frequency than the CORE clock. This can be confirmed here:
(source)
Which indicates that the XLB_Bus runs at a slower rate than the processor.
.
// Fill up the cache with pointless stuff
for (int i = 0; i < 4097; ++i) {
a = (int)TSin[i];
}
// 1. Test the dynamic RAM access with a cache miss every time.
ticks = timer_now();
// += 16 to ensure a cache line miss.
for (int y = 0; y < listSize; y += 16) {
a = (data1[y]);
}
elapsedTicks = timer_now() - ticks;
// Fill up the cache with pointless stuff again ...
ticks = timer_now();
// Test 2, do the same cycles but with static memory.
for (int y = 0; y < listSize; y += 16) {
a = (mpc5200.sram[y]);
}
elapsedTicks = timer_now() - ticks;
And with this we get the following results:
elapsed dynamic: 294.84 uS
elapsed static: 57.78 uS
So what we can say here is that the static RAM is faster than the dynamic RAM (expected) but when the dynamic RAM is loaded into cache accessing the static ram is much much slower because cache access is at processor speed and the static ram speed is much less than this.