Background
I have been looking into potentially using the MPC5200 static ram space as scratch pad memory. We have 16Kb of unused memory that appears on the processor bus (source).
Now some important implementation notes are:
This memory is used by the BestComm DMA controller, under
RTEMS
this will essentially set up a task table at the start of SRAM with a set of 16 tasks that can run as buffers for peripheral interface, I2C, Ethernet etc. In order to use this space without conflict and knowing that our system only uses a about 2Kb of Ethernet driver buffers, I offset the start of SRAM by 8Kb, so now we have 8Kb of memory that we know wont be used by the system.RTEMS
defines an array that points to static memory as follows:
(source)
typedef struct {
...
...
volatile uint8_t sram[0x4000];
} mpc5200_t;
extern volatile mpc5200_t mpc5200;
And i know that the sram array points to static memory because when I edit the first section and print out the memory block (MBAR + 0x8000
source)
So from here i can say the following, I have the RTEMS defined access to the SRAM
via mpc5200.sram[0 -> 0x2000]
. This means i can start doing some testing on the speed I can get out of it.
Test
In order to evaluate the speed, i set up the following test:
int a; // Global that is separate from the test.
**TEST**
// Set up the data.
const unsigned int listSize = 0x1000;
uint8_t data1[listSize];
for (int k = 0; k < listSize; ++k) {
data1[k] = k;
mpc5200.sram[k] = k;
}
// Test 1, data on regular stack.
clock_t start = clock();
for (int x = 0; x < 5000; ++x) {
for (int y = 0; y < 0x2000; ++y) {
a = (data1[y]);
}
}
double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
printf("elapsed dynamic: %f\n" ,elapsedTime);
// Test 2, get data from the static memory.
start = clock();
for (int x = 0; x < 5000; ++x) {
for (int y = 0; y < 0x2000; ++y) {
a = (mpc5200.sram[y]);
}
}
elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
printf("elapsed static: %f\n" ,elapsedTime);
Pretty simple, the concept is that we are iterating over the available space and setting a global. We should expect that the static memory should have the same approximate time.
RESULT
So we get the following:
elapsedDynamic = 1.415
elapsedStatic = 6.348
So there is something going on here, because the static is almost 6x slower than the cache.
Hypothesis
So i had 3 ideas about why this is:
- Cache misses, i thought maybe the fact that we are mixing dynamic and static ram that something strange is happening. So i tried this test:
.
// Some pointers to use as incrementers
uint8_t *i = reinterpret_cast<uint8_t*>(0xF0000000+0x8000+0x1000+1);
uint8_t *j = reinterpret_cast<uint8_t*>(0xF0000000+0x8000+0x1000+2);
uint8_t *b = reinterpret_cast<uint8_t*>(0xF0000000+0x8000+0x1000+3);
// I replaced all of the potential memory accesses with the static ram
// variables. That way the tests have no interaction in terms of
// memory locations.
start = clock();
// Test 2, get data from the static memory.
for ((*i) = 0; (*i) < 240; ++(*i)) {
for ((*j) = 0; (*j) < 240; ++(*j)) {
(*b) = (mpc5200.sram[(*j)]);
}
}
elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
printf("elapsed static: %f\n" ,elapsedTime);
We have the following results:
elapsedDynamic = 0.0010
elapsedStatic = 0.2010
So now it is 200 times slower? So i guess it is not to do with that?
Static memory different to normal, The next thing i thought was that maybe it doesn't interact how i thought it would because of this line:
MPC5200 contains 16KBytes of on-chip SRAM. This memory is directly accessible by the BestComm DMA unit. It is used primarily as storage for task table and buffer descriptors used by BestComm DMA to move peripheral data to and from SDRAM or other locations. These descriptors must be downloaded to the SRAM at boot. This SRAM resides in the MPC5200 internal register space and is also accessible by the processor core. As such it can be used for other purposes, such as scratch pad storage. The 16kBytes SRAM starts at location MBAR + 0x8000.
(source)
I am not sure how to confirm or deny this?
- Slower Static Clock, Perhaps the static memory runs on a slower clock, like in some systems?
This can be disproved by looking in the manual:
(source)
The SRAM and the processor were on the same clock, the XLB_CLK
runs at the Processor Fundamental Frequency (source)
QUESTION
What could be causing this, are there reasons in general not to use SRAM for scratch pad storage? I know on modern processors this would not even be considered but this is an older embedded processor and we are struggling for speed and space.
EXTRA TESTS
So after the comments below i performed some extra tests:
- Add
volatile
to the stack member to see if the speeds are more equal:
.
elapsedDynamic = 0.98
elapsedStatic = 5.97
So still much faster and not really any change with the volatile??
- Disassemble the code to see what is happening
.
// original code
int a = 0;
uint8_t data5[0x2000];
void assemblyFunction(void) {
int * test = (int*) 0xF0008000;
mpc5200.sram[0] = a;
data5[0] = a;
test[0] = a;
}
void assemblyFunction(void) {
// I think this is to load up A
0: 3d 20 00 00 lis r9,0
8: 80 09 00 00 lwz r0,0(r9)
14: 54 0a 06 3e clrlwi r10,r0,24
mpc5200.sram[0] = a;
1c: 3d 60 00 00 lis r11,0
20: 39 6b 00 00 addi r11,r11,0
28: 3d 6b 00 01 addis r11,r11,1 // Where do these come from?
2c: 99 4b 80 00 stb r10,-32768(r11)
test[0] = a;
c: 3d 20 f0 00 lis r9,-4096 // This should be the same as above??
10: 61 29 80 00 ori r9,r9,32768
24: 90 09 00 00 stw r0,0(r9)
data5[0] = a;
4: 3d 60 00 00 lis r11,0
18: 99 4b 00 00 stb r10,0(r11)
I am not particularly good at interpenetrating assembler, but perhaps we have a problem here? Accessing and setting the memory from a global does seem to take more instructions for the SRAM
?
- From the above test it seems that there are less instructions for the pointer so i added this:
.
uint8_t *p = (uint8_t*)0xF0008000;
// Test 3, get data from static with direct pointer.
for (int x = 0; x < 5000; ++x) {
for (int y = 0; y < 0x2000; ++y) {
a = (p[y]);
}
}
And i get the following result:
elapsed dynamic: 0.952750
elapsed static: 5.160250
elapsed pointer: 5.642125
So the pointer takes EVEN LONGER! I would have thought it would be exactly the same? This is just getting stranger.
RTEM's
implementation so im not to keen to play around with it. I believe so though since it can indeed be edited by devices etc (not on our system but in RTEMS in general). – Whitishvolatile
then benchmark again. I think the compiler has to be super paranoid about accesses there, it can't rely on registers, and it slows down considerably. – Hillardvolatile
to the stack memory component, then it should take the same amount of time? – Whitishvolatile
to your stackdata1
is the right test. – DrewIPB clock can run at the same frequency as XLB clock, or 1/2 the frequency. BestComm runs at the IPB clock frequency as do all IPB control register access logic.
So from this im not sure, i think it probably runs at the same rate as XLB. – Whitish